class: middle, center, title-slide
Lecture 8: GPT and Large Language Models
Prof. Gilles Louppe
[email protected]
???
R: refresh with Foundation Models
- BabyGPT
- Large language models
class: middle
.center[ See code/gpt/
. ]
class: middle
class: middle
.center[(March 2023)]
.footnote[Credits: lifearchitect.ai/models, 2023.]
class: middle
The decoder-only transformer has become the de facto architecture for large language models.
These models are trained with self-supervised learning, where the target sequence is the same as the input sequence, but shifted by one token to the right.
.footnote[Credits: Dive Into Deep Learning, 2023.]
class: middle
Historically, GPT-1 was first pre-trained and then fine-tuned on downstream tasks.
.footnote[Credits: Radford et al., Improving Language Understanding by Generative Pre-Training, 2018.]
class: middle
Transformer language model performance improves smoothly as we increase the model size, the dataset size, and amount of compute used for training.
For optimal performance, all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two.
.footnote[Credits: Kaplan et al, 2020.]
class: middle
Large models also enjoy better sample efficiency than small models.
- Larger models require less data to achieve the same performance.
- The optimal model size shows to grow smoothly with the amount of compute available for training.
.center.width-100[![](./figures/lec8/scaling-sample-conv.png)]
.footnote[Credits: Kaplan et al, 2020.]
class: middle
GPT-2 and following models demonstrated potential of using the same language model for multiple tasks, .bold[without updating the model weights].
Zero-shot, one-shot and few-shot learning consist in prompting the model with a few examples of the target task and letting it learn from them. This paradigm is called in-context learning.
class: middle
.footnote[Credits: Dive Into Deep Learning, 2023.]
class: middle, center
(demo)
class: middle
As language models grow in size, they start to exhibit emergent abilities that are not present in the original training data.
A (few-shot) prompted task is .bold[emergent] if it achieves random performance for small models and then (suddenly) improves as the model size increases.
class: middle
.footnote[Credits: Wei et al, 2022.]
class: middle
Notably, chain-of-thought reasoning is an emergent ability of large language models. It improves performance on a wide range of arithmetica, commonsense, and symbolic reasoning tasks.
.footnote[Credits: Wei et al, 2022b.]
class: middle
.footnote[Credits: Wei et al, 2022b.]
Increasing the model size does not inherently makes models follow a user's intent better, despite emerging abilities.
Worse, scaling up the model may increase the likelihood of undesirable behaviors, including those that are harmful, unethical, or biased.
class: middle
Human feedback can be used for better aligning language models with human intent, as shown by InstructGPT.
.footnote[Credits: Ouyang et al, 2022.]
class: middle
class: end-slide, center count: false
The end.