Probability Seminar
Monday, September 30, 2024 - 4:00pm
Malott 406
The output of a language model like ChatGPT is a probability distribution for the next "token" (a short word or fragment of a long word). Mathematically, what is the model actually doing when it generates these probabilities? I’ll narrate how the model translates its input text into a sequence of vectors, which pass through a succession of alternating linear and nonlinear layers. I’ll examine this architecture with a view toward some big-picture questions: Do language models “plan ahead” for future words, phrases, and sentences? And if so, then how does training the model to predict just one token at a time incentivize long-term planning? Based on joint work https://arxiv.org/abs/2404.00859 with Wilson Wu and John X. Morris.