class: middle, center, title-slide
Lecture 7: Attention and transformers
Prof. Gilles Louppe
[email protected]
Attention is all you need!
- Encoder-decoder
- Bahdanau attention
- Attention layers
- Transformers
???
Mission: learn about a novel and fundamental building block in modern neural networks. This brick can replace both FC and convolutional layers.
class: middle
class: middle
Many real-world problems require to process a signal with a sequence structure.
- Sequence classification:
- sentiment analysis in text
- activity/action recognition in videos
- DNA sequence classification
- Sequence synthesis:
- text synthesis
- music synthesis
- motion synthesis
- Sequence-to-sequence translation:
- speech recognition
- text translation
- part-of-speech tagging
.footnote[Credits: Francois Fleuret, 14x050/EE559 Deep Learning, EPFL.]
???
Draw all 3 setups.
class: middle
Given a set
.grid.center[ .kol-1-2.bold[Sequence classification] .kol-1-2[$f: S(\mathcal{X}) \to \bigtriangleup^C$] ] .grid.center[ .kol-1-2.bold[Sequence synthesis] .kol-1-2[$f: \mathbb{R}^d \to S(\mathcal{X})$] ] .grid.center[ .kol-1-2.bold[Sequence-to-sequence translation] .kol-1-2[$f: S(\mathcal{X}) \to S(\mathcal{Y})$] ]
In the rest of the slides, we consider only time-indexed signal, although it generalizes to arbitrary sequences.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
When the input is a sequence
.footnote[Credits: Dive Into Deep Learning, 2023.]
class: middle
Recurrent encoder-decoder models compress an input sequence
.footnote[Credits: Dive Into Deep Learning, 2023.]
???
Blackboard: translate to French the following sentence.
"The animal didn't cross the street because it was too tired."
->
"L'animal n'a pas traversé la rue car il était trop fatigué."
Imitate how the RNN would translate this sentence.
class: middle
This architecture assumes that the sole vector
???
There are not direct "channels" to transport local information from the input sequence to the place where it is useful in the resulting sequence.
The problem is similar as last time with FCN without skip connections: the information is bottlenecked in the single vector
class: middle
class: black-slide background-image: url(figures/lec7/vision.png) background-size: cover
class: middle
Using the nonvolitional cue based on saliency (red cup, non-paper), attention is involuntarily directed to the coffee.
.footnote[Credits: Dive Into Deep Learning, 2023.]
???
Volitional: Related to the faculty or power of using one's will.
class: middle
Using the volitional cue (want to read a book) that is task-dependent, attention is directed to the book under volitional control.
.footnote[Credits: Dive Into Deep Learning, 2023.]
class: middle
.footnote[Credits: Dive Into Deep Learning, 2023.]
class: middle
Attention mechanisms can transport information from parts of the input signal to parts of the output .bold[specified dynamically].
Under the assumption that each output token comes from one or a handful of input tokens, the decoder should attend to only those tokens that are relevant for producing the next output token.
???
Blackboard: translate to French the following sentence.
"The animal didn't cross the street because it was too tired."
->
"L'animal n'a pas traversé la rue car il était trop fatigué."
class: middle
.footnote[Credits: Dive Into Deep Learning, 2023.]
???
Same RNN-based encoder-decoder architecture, but with an attention mechanism in between.
class: middle
Following Bahdanau et al. (2014), the encoder is specified as a bidirectional recurrent neural network (RNN) that computes an annotation vector for each input token,
From this, they compute a new process
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
Given
Then, compute the context vector from the weighted
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
???
Note that the attention weights depend on the content, rather than on the position in sentence. This means they act as a form of content-based addressing.
class: middle
The model can now make the prediction
This is context attention, where
???
Do a blackboard example.
class: middle
???
- Source = English
- Target = French
class: middle
class: middle
The attention mechanisms can be defined generically as follows.
Given a context or query vector
class: middle
.footnote[Credits: Dive Into Deep Learning, 2023.]
class: middle
When queries and keys are vectors of different lengths, we can use an additive attention as the scoring function.
Given
class: middle
When queries and keys are vectors of the same length
Given
class: middle
For
class: middle
Recall that the dot product is simply a un-normalised cosine similarity, which tells us about the alignment of two vectors.
Therefore, the
class: middle
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
In the currently standard models for sequences, the queries, keys and values are linear functions of the inputs.
Given the learnable matrices
class: middle
When the queries, keys and values are derived from the same inputs, the attention mechanism is called self-attention.
For the scaled dot-product attention, the self-attention layer is obtained when
Therefore, self-attention can be used as a regular feedforward-kind of layer, similarly to fully-connected or convolutional layers.
.center.width-60[![](figures/lec7/self-attention-layer.svg)]
class: middle
.footnote[Credits: Dive Into Deep Learning, 2023.]
???
Compare visually on the blackboard and show the similarities and differences.
class: middle
where
???
As noted in Table 1, a self-attention layer connects all positions with a constant number of sequentially
executed operations, whereas a recurrent layer requires O(n) sequential operations. In terms of
computational complexity, self-attention layers are faster than recurrent layers when the sequence
length n$$ is smaller than the representation dimensionality
A single convolutional layer with kernel width
class: middle
To illustrate the behavior of the attention mechanism, we consider a toy problem with 1d sequences composed of two triangular and two rectangular patterns. The target sequence averages the heights in each pair of shapes.
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
We can modify the toy problem to consider targets where the pairs to average are the two right and leftmost shapes.
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
The performance is expected to be poor given the inability of the self-attention layer to take into account absolute or relative positions. Indeed, self-attention is permutation-invariant:
$$\begin{aligned}
\mathbf{y} &= \sum_{i=1}^m \text{softmax}_i\left(\frac{\mathbf{q}^T{\mathbf{K}^T_{i}}}{\sqrt{d}}\right) \mathbf{V}_{i}\\
&= \sum_{i=1}^m \text{softmax}_{i}\left(\frac{\mathbf{q}^T{\mathbf{K}^T_{\sigma(i)}}}{\sqrt{d}}\right) \mathbf{V}_{\sigma(i)}
\end{aligned}$$
for any permutation
(It is also permutation-equivariant with permutation
class: middle
However, this problem can be fixed by providing positional encodings explicitly to the attention layer.
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
class: middle
Vaswani et al. (2017) proposed to go one step further: instead of using attention mechanisms as a supplement to standard convolutional and recurrent layers, they designed a model, the .bold[transformer], combining only attention layers.
The transformer was designed for a sequence-to-sequence translation task, but it is currently key to state-of-the-art approaches across NLP tasks.
.footnote[Credits: Francois Fleuret, Deep Learning, UNIGE/EPFL.]
class: middle
The first building block of the transformer architecture is a scaled dot-production attention module
class: middle
The transformer projects the queries, keys and values
.footnote[Credits: Dive Into Deep Learning, 2023.]
class: middle
The transformer model is composed of:
- An encoder that combines
$N=6$ modules, each composed of a multi-head attention sub-module, and a (per-component) one-hidden-layer MLP, with residual pass-through and layer normalization. All sub-modules and embedding layers produce outputs of dimension$d_\text{model}=512$ . - A decoder that combines
$N=6$ modules similar to the encoder, but using masked self-attention to prevent positions from attending to subsequent positions. In addition, the decoder inserts a third sub-module which performs multi-head attention over the output of the encoder stack.
class: middle
.footnote[Credits: Dive Into Deep Learning, 2023.]
class: middle
The encoders start by processing the input sequence. The output of the top encoder is then transformed into a set of attention vectors
.footnote[Credits: Jay Alammar, The Illustrated Transformer.]
class: middle
Each step in the decoding phase produces an output token, until a special symbol is reached indicating the completion of the transformer decoder's output.
The output of each step is fed to the bottom decoder in the next time step, and the decoders bubble up their decoding results just like the encoders did.
.footnote[Credits: Jay Alammar, The Illustrated Transformer.]
class: middle
In the decoder:
- The first masked self-attention sub-module is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions.
- The second multi-head attention sub-module works just like multi-head self-attention, except it creates its query matrix from the layer below it, and takes the keys and values matrices from the output of the encoder stack.
.footnote[Credits: Jay Alammar, The Illustrated Transformer.]
class: middle
As each word in a sentence .italic[simultaneously] flows through the encoder/decoder stack, the model itself does not have any sense of position/order for each word.
Positional information is provided through an additive positional encoding of the same dimension
After adding the positional encoding, words will be closer to each other based on the similarity of their meaning and their relative position in the sentence, in the
???
All words of input sequence are fed to the network with no special order or position; in contrast, in RNN architecture, 𝑛-th word is fed at step 𝑛, and in ConvNet, it is fed to specific input indices. Therefore, proposed model has no idea how the words are ordered.
Draw https://datascience.stackexchange.com/questions/51065/what-is-the-positional-encoding-in-the-transformer-model on black board.
class: middle
.center[128-dimensional positonal encoding for a sentence with the maximum length of 50. Each row represents the embedding vector.]
class: middle
The transformer architecture was first designed for machine translation and tested on English-to-German and English-to-French translation tasks.
.center[
Self-attention layers learned that "it" could refer
to different entities, in different contexts.
]
.footnote[Credits: Transformer: A Novel Neural Network Architecture for Language Understanding, 2017.]
class: middle
.center[
Attention maps extracted from the multi-head attention modules
show how input tokens relate to output tokens.
]
.footnote[Credits: Transformer model for language understanding.]
class: middle
The decoder-only transformer has become the de facto architecture for large language models
These models are trained with self-supervised learning, where the target sequence is the same as the input sequence, but shifted by one token to the right.
.footnote[Credits: Dive Into Deep Learning, 2023.]
class: middle
Historically, GPT-1 was first pre-trained and then fine-tuned on downstream tasks.
.footnote[Credits: Radford et al., Improving Language Understanding by Generative Pre-Training, 2018.]
class: middle
Transformer language model performance improves smoothly as we increase the model size, the dataset size, and amount of compute used for training.
For optimal performance, all three factors must be scaled up in tandem. Empirical performance has a power-law relationship with each individual factor when not bottlenecked by the other two.
.footnote[Credits: Kaplan et al, 2020.]
class: middle
Large models also enjoy better sample efficiency than small models.
- Larger models require less data to achieve the same performance.
- The optimal model size shows to grow smoothly with the amount of compute available for training.
.center.width-100[![](./figures/lec7/scaling-sample-conv.png)]
.footnote[Credits: Kaplan et al, 2020.]
class: middle
All modern conversational agents are based on the same transformer models, scaled up to billions of parameters, trillions of training tokens, and thousands of petaflop/s-days of compute.
class: middle count: false
class: middle
The transformer architecture was first designed for sequences, but it can be adapted to process images.
The key idea is to reshape the input image into a sequence of patches, which are then processed by a transformer encoder. This architecture is known as the .bold[vision transformer] (ViT).
class: middle
.footnote[Credits: Dive Into Deep Learning, 2023.]
class: middle
- The input image is divided into non-overlapping patches, which are then linearly embedded into a sequence of vectors.
- The sequence of vectors is then processed by a transformer encoder, which outputs a sequence of vectors.
- Training the vision transformer can be done with supervised or self-supervised learning.
class: middle
Just like text transformers, vision transformers learn representations of the input image that can be used for various tasks, such as image classification, object detection, and image generation.
class: middle
.center[Segment anything (Kirillov et al., 2023) combines a vision transformer with a prompt encoder to produce masks with a transformer-based decoder.]
class: middle, center, black-slide
<iframe width="600" height="450" src="https://www.youtube.com/embed/oYUcl_cqKcs" frameborder="0" allowfullscreen></iframe>Segment anything (Kirillov et al., 2023)
class: end-slide, center count: false
The end.