Why Positional Encoding is important in Transformer Architecture? | by Taaha Mushtaq

A diagram of transformer structure

On this article, I don’t plan to elucidate transformer structure in depth as there are at present a number of nice tutorials on this matter (here, here, and here), however alternatively, I wish to focus on one particular a part of the transformer’s structure — the positional encoding.

Place and order of phrases are the important elements of any language. They outline the grammar and thus the precise semantics of a sentence. Recurrent Neural Networks (RNNs) inherently take the order of phrase into consideration; They parse and take enter sentence phrase by phrase in a sequential method. It will combine the phrases’ order within the spine of RNNs.

However the Transformer structure disregarded the recurrence mechanism in favor of multi-head self-attention mechanism. Avoiding the RNNs’ methodology of recurrence will lead to huge speed-up within the coaching time. And theoretically, it might seize longer dependencies in a sentence.

As every phrase in a sentence concurrently flows via the Transformer’s encoder/decoder stack, The mannequin itself doesn’t have any sense of place/order for every phrase. Consequently, there’s nonetheless the necessity for a approach to incorporate the order of the phrases into our mannequin.

The formulation for positional encoding used within the Transformer paper, “Consideration is All You Want” by Vaswani et al., is outlined as follows:

Formulation for Positional encoding, talked about in Transformer paper

pos is the place of the token within the sequence.
i is the dimension index.
d mannequin is the dimension of the mannequin (the variety of dimensions within the embedding vector).

If we take an instance sentence “I like cats” ,the sentence has to tokenized and mapped to a numerical vocabulary.

tokenizing our pattern sentence and mapping to numerical references

Output:

So, our enter sentence “I like cats” was tokenized into an inventory of tokens [“I”, “love”, “cats”] after which mapped to numerical values [0, 1, 2].

Relating our sentence to the positional encoding formulation, pos represents the place of the phrase. For instance, the place (pos) of “I” is 0, the place of “love” is 1, and the place of “cats” is 2.

Now, for the sake of simplicity, let’s assume we would like our mannequin dimension (d_model), or the dimension of our positional encoding vector, to be equal to 4. This implies every place could have a corresponding positional encoding vector.

For place 0 (“I”), place 1 (“love”), and place 2 (“cats”), the vectors will comply with the sample:

the place the primary factor of every tuple within the checklist represents the place (pos), which stays the identical because the phrase’s place within the sentence, and the second factor represents the dimension index (i) within the d_model

After I say for even dimensions we use sine operate and for odd dimensions we use cosine operate , this corresponds to the ith dimension of d_model that are the second factor of tuples for reference

So, listed below are the positional encoded vectors of the sentence “I like cats” the place d_model is 4

Place 0

[0,1,0,1]

Place 1

[0.8415,0.5403,0.01,0.99995]

Place 2

[0.9093,−0.4161,0.02,0.9998]

Now, this positional encoded vector for every token is added to its phrase embedding vector and a brand new enter embedding vector is generated by way of factor sensible addition of the above talked about vectors which is the enter to our transformer mannequin.

By including the positional embeddings to the phrase embeddings, every enter embedding now comprises each the semantic that means of the phrase and its positional context. Right here’s how this addition helps:

Semantic That means: The phrase embedding a part of the enter embedding vector retains the semantic details about the phrase.
Positional Context: The positional embedding a part of the enter embedding vector encodes the phrase’s place inside the sequence.

This mixture permits the Transformer mannequin to tell apart between an identical phrases showing in several positions and perceive their roles in several contexts. The mannequin’s self-attention mechanism can then use these enriched enter embeddings to deal with related elements of the sequence, taking into consideration each the that means and the place of every phrase.

References:

The illustrative transformer: https://jalammar.github.io/illustrated-transformer/

LLMs from scratch:https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb

Source link

Controlling Bias and Variance with Regularization Strategies | by Rakesh Ganya | Jul, 2024

AI for not technical Founders.. Introduction | by Daniel Meléndez | Jul, 2024

Exploring Unsupervised Learning Algorithms | by Himanshu Yadav | Jul, 2024

Say ‘Hi’ to The Acolyte’s New Little Guy

‘Metroid Prime 4’ Gets a Release Date After Years of Troubled Development

Nvidia, with $3.34 Trillion Market Cap, Becomes Most Valuable Company

Netflix House will open two locations in Texas and Pennsylvania in 2025

CoinPoker Up 80x During Bear Market – Could It Be the Best Crypto Gaming Platform? ClayBro’s Video Reviews

Most Popular

Say ‘Hi’ to The Acolyte’s New Little Guy

‘Metroid Prime 4’ Gets a Release Date After Years of Troubled Development

Nvidia, with $3.34 Trillion Market Cap, Becomes Most Valuable Company

Our Picks

Tiptop Helps You Trade In Your Old Stuff to Pay for New Stuff

Early Prime Day deals include up to 58 percent off Amazon Fire tablets

Demystifying Machine Learning Workflows with MLflow | by CROZ | Jun, 2024

Why Positional Encoding is important in Transformer Architecture? | by Taaha Mushtaq | Jul, 2024

Place 0

Place 1

Place 2

Related Posts