A diagram of transformer structure
On this article, I don’t plan to elucidate transformer structure in depth as there are at present a number of nice tutorials on this matter (here, here, and here), however alternatively, I wish to focus on one particular a part of the transformer’s structure — the positional encoding.
Place and order of phrases are the important elements of any language. They outline the grammar and thus the precise semantics of a sentence. Recurrent Neural Networks (RNNs) inherently take the order of phrase into consideration; They parse and take enter sentence phrase by phrase in a sequential method. It will combine the phrases’ order within the spine of RNNs.
However the Transformer structure disregarded the recurrence mechanism in favor of multi-head self-attention mechanism. Avoiding the RNNs’ methodology of recurrence will lead to huge speed-up within the coaching time. And theoretically, it might seize longer dependencies in a sentence.
As every phrase in a sentence concurrently flows via the Transformer’s encoder/decoder stack, The mannequin itself doesn’t have any sense of place/order for every phrase. Consequently, there’s nonetheless the necessity for a approach to incorporate the order of the phrases into our mannequin.
The formulation for positional encoding used within the Transformer paper, “Consideration is All You Want” by Vaswani et al., is outlined as follows:
- pos is the place of the token within the sequence.
- i is the dimension index.
- d mannequin is the dimension of the mannequin (the variety of dimensions within the embedding vector).
If we take an instance sentence “I like cats” ,the sentence has to tokenized and mapped to a numerical vocabulary.
Output:
So, our enter sentence “I like cats” was tokenized into an inventory of tokens [“I”, “love”, “cats”] after which mapped to numerical values [0, 1, 2].
Relating our sentence to the positional encoding formulation, pos
represents the place of the phrase. For instance, the place (pos
) of “I” is 0, the place of “love” is 1, and the place of “cats” is 2.
Now, for the sake of simplicity, let’s assume we would like our mannequin dimension (d_model
), or the dimension of our positional encoding vector, to be equal to 4. This implies every place could have a corresponding positional encoding vector.
For place 0 (“I”), place 1 (“love”), and place 2 (“cats”), the vectors will comply with the sample:
the place the primary factor of every tuple within the checklist represents the place (pos
), which stays the identical because the phrase’s place within the sentence, and the second factor represents the dimension index (i
) within the d_model
After I say for even dimensions we use sine operate and for odd dimensions we use cosine operate , this corresponds to the ith dimension of d_model that are the second factor of tuples for reference
So, listed below are the positional encoded vectors of the sentence “I like cats” the place d_model is 4
Place 0
[0,1,0,1]
Place 1
[0.8415,0.5403,0.01,0.99995]
Place 2
[0.9093,−0.4161,0.02,0.9998]
Now, this positional encoded vector for every token is added to its phrase embedding vector and a brand new enter embedding vector is generated by way of factor sensible addition of the above talked about vectors which is the enter to our transformer mannequin.
By including the positional embeddings to the phrase embeddings, every enter embedding now comprises each the semantic that means of the phrase and its positional context. Right here’s how this addition helps:
- Semantic That means: The phrase embedding a part of the enter embedding vector retains the semantic details about the phrase.
- Positional Context: The positional embedding a part of the enter embedding vector encodes the phrase’s place inside the sequence.
This mixture permits the Transformer mannequin to tell apart between an identical phrases showing in several positions and perceive their roles in several contexts. The mannequin’s self-attention mechanism can then use these enriched enter embeddings to deal with related elements of the sequence, taking into consideration each the that means and the place of every phrase.
References:
The illustrative transformer: https://jalammar.github.io/illustrated-transformer/
LLMs from scratch:https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb