Understanding Transformer

2024 Sep 29
Back

In 2022, I wrote an article titled The Future of Humanity, which carried quite an ambitious title. In the article, I outlined five areas, each representing fields I was deeply eager to explore and contribute to, with the hope of witnessing real progress for humanity.

But I was stuck in the grind of daily work and couldn’t jump out—until last year when I chose to leave Scroll and finally started focusing on the work I was truly passionate about (btw, Scroll is still a great company and has the #1 TVL among all zkEVM Rollups). I launched a small experiment called D-Day, aiming to bring LLM-powered AI agents into the blockchain space. In this experiment, each EOA and smart contract acted as a finely trained AI agent capable of performing any operation on the blockchain just like a human (and yes, I do mean any). These AI agents would compete against the smartest humans to see who could capture the most alpha. The basic idea was quite similar to Vitalik’s concept in his article, ‘AI as a player in a game’.

At that time, there were very few examples of the intersection between crypto and AI. But now, we’re starting to see more fun applications, from Coinbase’s AI API to cases where AI agents use crypto as payment. AI is poised to be a huge new growth area for crypto, and in the future, the scale of on-chain activity and transaction volume generated by AI could surpass that of real human participants.

As for D-Day, the project exceeded my expectations in ways I hadn’t imagined, and I would say it was a great success! I plan to write a dedicated article explaining how D-Day works in the next few months. But back then, this project nearly killed me. I was too naive to understand the challenges I was facing. To complete D-Day, I have to build an extensive infrastructure from scratch, including data processing, blockchain interactions, Solidity contracts, model fine-tuning, and interfaces that enabled AI agents to handle large-scale blockchain transactions. In the end, I even implemented a distributed load system across multiple machines. Ultimately, with the help of LLMs, I wrote between 200,000 to 250,000 lines of code across various libraries, with over 10,000 commits on GitHub.

Core Codebase Metrics and Development Trajectory

I’m starting to get curious about why AI is so powerful. With its capabilities, I was able to create AI agents that performed at a human level of intelligence in on-chain activities. As data quality improves and models themselves advance—especially with the development of dedicated models for crypto use cases based on on-chain data—this process will accelerate, allowing AI to surpass humans in the on-chain world and positioning crypto as a primary economic layer and payment network for AI.

Driven by my curiosity and the experience of Project D-Day, I have grown increasingly obsessed with understanding “intelligence”. I’m increasingly fascinated by what’s inside the “mystery box” of AI (now LLMs)—whether it holds a snake or a cake—and the impact it will have on human society. Therefore, I have started digging into the origins of it all.

The story begins with the Transformer—the foundation of all things LLM. This article is my attempt to articulate my understanding of it.

Table of contents
Background
How Transformer Works
Conclusion

Background

The Transformer is fundamentally crucial to the development of both NLP and LLMs, as nearly all of today’s most powerful LLM products are built on the Transformer architecture. For instance, GPT-like models—such as Grok, Llama, Gemini, and Claude are trained on vast human knowledge resources, enabling them to generate diverse responses for users on a daily basis.

A fun meme

The Transformer was first introduced in 2017 in the groundbreaking paper “Attention Is All You Need”. At the time, the Transformer was primarily proposed as an improvement over Seq2Seq models (RNN/LSTM) to address the issue of long-distance dependencies in machine translation. Long-distance dependencies refer to the challenge of understanding relationships between words that are far apart in long sentences. For instance, in the sentence The cat that chased the mouse ran away, if there are hundreds of words between the cat and ran away, traditional models may struggle to maintain the connection between these two phrases, leading to inaccurate translations.

The first major improvement of the Transformer is the self-attention mechanism. While the Transformer shares a similar structure with Seq2Seq models, as both use an encoder and a decoder, the Transformer discards the sequential processing nature of RNN/LSTM and instead introduces the self-attention mechanism. This mechanism allows the model to compute the relevance of each word to every other word in the sequence, rather than being limited to capturing dependencies only between adjacent words, as is the case with RNN/LSTM models. This significantly improves performance in tasks like machine translation, especially when handling long texts, as the Transformer can better capture global context.

The second major improvement of the Transformer is its ability to perform parallel computation, which greatly enhances computational efficiency. In traditional RNN/LSTM models, input sequences must be processed in a strict time order, meaning each time step’s computation depends on the output of the previous step. This inherently prevents parallelization, leading to inefficiencies, particularly when processing long sequences. However, the Transformer uses positional encoding to assign each word a unique positional embedding, allowing the model to abandon the sequential nature of traditional models. Combined with the self-attention mechanism, this allows every word to be processed simultaneously, enabling true parallel computation. This dramatically improves efficiency and makes the Transformer more suitable for running on GPUs and TPUs, laying the foundation for large-scale training.

Transformer got the highest BLEU score

As you can see, the Transformer was originally proposed for the specific area of machine translation and was not initially applied to large language models. It wasn’t until later, with the advancements made by BERT and GPT-1, and the release of GPT-3, which offered end-user products, that the LLM field truly began to flourish.

Today, the Transformer is not only used for machine translation and text generation but has also been widely applied to multimodal fields such as images, audio, and video.

How Transformer Works

How it roughly works

As the diagram below, the fundamental architecture of the Transformer consists of two primary modules: Encoder and Decoder.

Transformer Architecture and Workflow

The Encoder and Decoder can be utilized independently or in conjunction. An Encoder-only architecture can serve as a semantic analysis model. For instance, I developed a sentiment analysis model utilizing an Encoder-only framework, trained on IMDB review data to classify the sentiment of input text as either positive or negative. Conversely, a Decoder-only architecture is utilized for autoregressive text generation tasks. Models such as GPT and other large language models (LLMs) predominantly depend on the Decoder structure, as exemplified by my local implementation of GPT-2. The Encoder-Decoder architecture is well-suited for translation tasks, where the Encoder interprets the meaning and content of the source language, while the Decoder generates the output in the target language. You can view the simple architecture I implemented.

The primary function of the Encoder is “understanding”, which entails helping the model grasp the relationships between words. However, computers cannot directly interpret human language; therefore, we first convert natural language into high-dimensional vector representations through Token Embedding, making it suitable for computational processing. Typically, this involves some standard data preprocessing, which is significantly streamlined by tools provided by PyTorch and TensorFlow. After completing the Token Embedding, the model applies Positional Encoding to add unique positional information to each word. This ensures that even during parallel computation, the model can recognize the order of words, thereby preserving the sequence.

Once human language input is converted into vectors, it enters the Multi-Head Attention (MHA). At this stage, the core attention mechanism of the Transformer evaluates the relevance of each word in relation to all others. This process allows the model to capture semantic relationships between words. After calculating these relevance weights through MHA, the data is forwarded to the Feed-Forward Neural Network (FNN) for further processing.

You may have noticed the “Nx” notation in the diagram, which represents the number of layers in the Encoder (also in Decoder). In the Encoder, while the input to the first layer is the raw natural language, the output of each layer becomes the input for the next. In an Encoder-Decoder architecture, the output from the final layer of the Encoder also serves as the one of inputs to the Decoder.

The Decoder is responsible for generating the next word by using the relevance weights calculated earlier. While its structure closely mirrors that of the Encoder, it includes a crucial addition: Masked Multi-Head Attention (MHA). This modification is essential in autoregressive models, where the model can only rely on previously generated words to predict the next one, ensuring that future words remain inaccessible during attention calculations.

With this design, the Transformer efficiently processes both semantic and sequential information in natural language, while also enabling large-scale parallel computation. In the sections that follow, we will delve into the details of each component. Let’s start with how machines interpret human language.

Embedding Process

The Embedding Layer is primarily composed of two parts: Token Embedding and Positional Embedding. The primary role of Token Embedding is to convert human language (such as English or Japanese) into vector representations that computers can process and compute.

Let’s illustrate this with the sentence 'I love Ethereum and Transformer.' and examine how Token Embedding is performed. First, the sentence is broken down into smaller units, known as Tokens, resulting in ['I', 'love', 'Ethereum', 'and', 'Transformer', '.']. However, in practice, tokenization results can vary depending on the Tokenization Algorithm used. For example, the Byte Pair Encoding (BPE) algorithm used by models like GPT and Llama3 might further split these Tokens into ['I', 'love', 'Eth', 'ereum', 'and', 'Trans', 'former', '.']. I wrote a simple tokenization algorithm to demonstrate this effect.

Once tokenization is complete, each Token is mapped to a unique Token ID from the vocabulary. This ID is derived from the vocabulary built during the pre-training phase, ensuring that every word or subword has a corresponding number. You can run this code to check the tokenized output and its corresponding IDs (which should match the table below).

Token	Token ID
i	1045
love	2293
eth	6832
##ereum	24932
and	199
trans	3230
##former	14458
.	1012

This step converts natural language into a sequence of integers that machines can process. However, these IDs don’t carry any semantic meaning. To help the model understand the significance of each word, these Token IDs are mapped to high-dimensional vectors, known as Token Embeddings. By looking them up in the Embedding Matrix, each Token ID is matched with a vector that captures the Token’s semantic properties, allowing the model to interpret meaning during language processing.

The table below shows the mapping of Token IDs to their respective Token Embeddings. For simplicity, we set $d_{model}=4$.

Token	Token ID	Token Embedding
i	1045	[0.12, -0.34, 0.56, 0.78]
love	2293	[0.22, 0.33, -0.11, 0.44]
eth	6832	[-0.12, 0.56, -0.44, 0.33]
##ereum	24932	[0.65, -0.34, 0.89, -0.12]
and	199	[0.14, -0.45, 0.33, -0.67]
trans	3230	[0.22, 0.49, 0.55, -0.44]
##former	14458	[-0.12, 0.56, 0.77, -0.11]
.	1012	[0.44, 0.11, 0.33, -0.22]

Note: The Token Embeddings shown here are randomly generated from a randomized Embedding Matrix.

The dimension of Token Embedding is typically defined during the model training. Higher dimensions can capture richer semantic information, but they also increase computational complexity and resource consumption. For example, in the original Transformer paper, the embedding dimension is $d_{model}=512$, while in more complex models like GPT-3 large, the embedding dimension reaches $d_{model} = 4096$ to handle more intricate language tasks. When we train models, a balance between expressiveness and computational resources is often necessary.

And since the Transformer model utilizes parallel computation, it cannot convey sequential information recursively like RNNs. Therefore, it introduce Positional Encoding to allow the model to recognize the position of each Token in the sentence. In the original Transformer paper, Positional Encodings are generated using a $sin(x)$ function and a $cos(x)$ function. This design ensures that the variations across different dimensions are distinct, helping the model capture both local and global positional information.

Specifically, even dimensions use $sin(x)$ functions:

\[PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)\]

While odd dimensions use $cos(x)$ functions:

\[PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}}\right)\]

Based on $pos$ (the position of the Token in the sentence), $i$ (the specific dimension of the Token), and $d_{model}$, we can calculate the corresponding $PE$ (Positional Encoding) for each dimension. The variation rate of $PE$ is determined by a scaling factor $\frac{10000^{2i}}{d_{\text{model}}}$. This scaling factor introduces different rates of change across dimensions, which adjusts the periodicity of the positional encoding functions relative to the embedding dimension size. This ensures that each dimension captures distinct aspects of the position information. For example, as shown in the figure below:

Low dimensions (e.g., d=0 to d=50): Have shorter periods and faster variations, suitable for capturing local (short-distance) positional information.
High dimensions (e.g., d=100 to d=250): Have longer periods and slower variations, suitable for capturing global (long-distance) positional information.

In this way, we can construct the Positional Embedding to match the dimensions of the Token Embedding. Matching the dimensions of the vectors is crucial for performing element-wise addition between the two vectors. To aid understanding, we will continue using the previous example with $d_{model}=4$.

Token	Position	pos	i	Positional Embedding
i	no.1	0	[0,1,2,3]	[0.0, 1.0, 0.0, 1.0]
love	no.2	1	[0,1,2,3]	[0.84147102, 0.54030228, 0.00999983, 0.99994999]
eth	no.3	2	[0,1,2,3]	[0.090929741, -0.41614681, 0.01999867, 0.9980003]
##ereum	no.4	3	[0,1,2,3]	[0.14112, -0.9899925, 0.0299955, 0.9955004]
and	no.5	4	[0,1,2,3]	[-0.7568025, -0.65364361, 0.03998933, 0.99920011]
trans	no.6	5	[0,1,2,3]	[-0.95892429, 0.28366217, 0.04997917, 0.99875027]
##former	no.7	6	[0,1,2,3]	[-0.27941549, 0.96017027, 0.059964, 0.99820054]
.	no.8	7	[0,1,2,3]	[0.65698669, 0.75390226, 0.06994285, 0.99755102]

Patch: Previously, I generated a randomized set of Positional Embeddings, but we actually need to use the PE formulas to calculate the real PE values here. You can check the calculation code here.

A big thanks to Vitalik for catching the calculation error! If it hadn’t been fixed, it could have caused some misunderstandings among readers. I’m once again amazed by how quickly he learns things 😯.

Ultimately, by performing element-wise addition of the Token Embedding and the Positional Embedding, we obtain the final Input Embedding. This Embedding contains not only the semantic information of each Token but also integrates the positional information of that Token within the sentence, enabling the model to simultaneously process the meaning of the vocabulary and its contextual positional relationships.

Token	Token ID	Token Embedding	Positional Embedding	Final Input Embedding
i	1045	[0.12, -0.34, 0.56, 0.78]	[0.0, 1.0, 0.0, 1.0]	[0.12, 0.66, 0.56, 1.78]
love	2293	[0.22, 0.33, -0.11, 0.44]	[0.84147102, 0.54030228, 0.00999983, 0.99994999]	[1.06147102, 0.87030228, -0.10000017, 1.43994999]
eth	6832	[-0.12, 0.56, -0.44, 0.33]	[0.090929741, -0.41614681, 0.01999867, 0.9980003]	[-0.02907026, 0.14385319, -0.42000133, 1.3280003]
##ereum	24932	[0.65, -0.34, 0.89, -0.12]	[0.14112, -0.9899925, 0.0299955, 0.9955004]	[0.79112, -1.3299925, 0.9199955, 0.8755004]
and	199	[0.14, -0.45, 0.33, -0.67]	[-0.7568025, -0.65364361, 0.03998933, 0.99920011]	[-0.6168025, -1.10364361, 0.36998933, 0.32920011]
trans	3230	[0.22, 0.49, 0.55, -0.44]	[-0.95892429, 0.28366217, 0.04997917, 0.99875027]	[-0.73892429, 0.77366217, 0.59997917, 0.55875027]
##former	14458	[-0.12, 0.56, 0.77, -0.11]	[-0.27941549, 0.96017027, 0.059964, 0.99820054]	[-0.39941549, 1.52017027, 0.829964, 0.88820054]
.	1012	[0.44, 0.11, 0.33, -0.22]	[0.65698669, 0.75390226, 0.06994285, 0.99755102]	[1.09698669, 0.86390226, 0.39994285, 0.77755102]

You can check the Input Calculation code here.

We can express this with the formula:

\[f(x_i) = E(x_i) + P(i)\]

where $E(x_i)$ is the Token Embedding for the $i$th word, and $P(i)$ is the Positional Embedding for the $i$th word.

Scaled Dot-Product Attention

Once the model processes the Input Embedding, each word is converted into a vector representation that combines both semantic and positional information. However, at this point, the model only knows the content we have inputted but cannot comprehend the relationships and contextual connections between these contents. To enable the model to effectively capture dependencies and interactions within the sentence, we introduce the Self-Attention Mechanism.

Traditional Attention Mechanisms are typically used to handle relationships between different sequences, especially in tasks like machine translation, where the model computes the mapping between the source and target languages. In contrast, Self-Attention focuses on the relationships among words within the same sequence. By calculating the relevance weights of each word to other words in the sentence, the model is able to establish global contextual dependencies. This allows it to effectively capture the intricate connections between words, enabling a deeper understanding of their relationships throughout the entire sequence.

For Self-Attention Mechanism, the most common implementation is the Scaled Dot-Product Attention proposed in the paper, which computes the relevance weights between Query and Key using dot products and scales the results to ensure stable gradient flow. To better understand the computation process of Scaled Dot-Product Attention, let’s use the sentence I love Transformer as an example to explore how the model understands the dependencies between these 3 words.

First, we obtain the Input Embedding for each word through the embedding process, and these Input Embeddings will serve as the input for the Self-Attention mechanism.

Token	Token Embedding	Positional Encoding	Input Embedding
I	[1, 0, 1]	[0.1, 0.2, 0.3]	[1.1, 0.2, 1.3]
love	[0, 1, 0]	[0.2, 0.1, 0.2]	[0.2, 1.1, 0.2]
Transformer	[1, 0, 0]	[0.3, 0.1, 0.1]	[1.3, 0.1, 0.1]

To simplify matrix operations, the Token Embedding and Positional Embedding values are predefined.

In the Self-Attention mechanism, each word generates a corresponding set of Query, Key, and Value. For example, in the case of the word "I", the Query is a vector that represents how the word seeks information or “queries” other words in the sentence. The Key, on the other hand, is a vector that encodes the identifying features of the word itself. When other words interact with “I”, they compare their Query to the Key of “I” to assess how relevant “I” is to them. The Value holds the semantic information of a word, ensuring that its original meaning is preserved and effectively transmitted through the attention mechanism.

The Q, K, and V are generated by multiplying the word’s Input Embedding by the respective weight matrices. These weight matrices are crucial, with dimensions of $d_{model} * d_k$, where $d_{model}$ is the dimension of the Input Embedding and $d_k$ is the dimension of the Query and Key. For simplification, we set $d_{model} = d_k = 3$. Now, I will use three example weight matrices to help understand the calculation.

Example Weight Matrix of Query:

\[W^Q = \begin{bmatrix} 0.5 & 0.2 & 0.1 \\ 0.4 & 0.3 & 0.7 \\ 0.2 & 0.5 & 0.6 \end{bmatrix}\]

Example Weight Matrix of Key:

\[W^K = \begin{bmatrix} 0.6 & 0.1 & 0.3 \\ 0.5 & 0.7 & 0.2 \\ 0.3 & 0.4 & 0.8 \end{bmatrix}\]

Example Weight Matrix of Value:

\[W^V = \begin{bmatrix} 0.1 & 0.3 & 0.5 \\ 0.4 & 0.5 & 0.6 \\ 0.2 & 0.7 & 0.8 \end{bmatrix}\]

In real-world applications, these weight matrices are usually more complex. At model initialization, they are assigned random values, and as training progresses, these matrices are gradually optimized through backpropagation. This process allows the model to learn how to generate more effective Query, Key, and Value, thereby improving its performance and capacity to capture complex relationships.

Next, we perform matrix multiplication with the given weight matrices $W^Q$, $W^K$, and $W^V$ to compute the $Q$, $K$, and $V$ for the word "I".

\[Q_I = W^Q \cdot \begin{bmatrix} 1.1 \\ 0.2 \\ 1.3 \end{bmatrix} = \begin{bmatrix} 0.5 \cdot 1.1 + 0.2 \cdot 0.2 + 0.1 \cdot 1.3 \\ 0.4 \cdot 1.1 + 0.3 \cdot 0.2 + 0.7 \cdot 1.3 \\ 0.2 \cdot 1.1 + 0.5 \cdot 0.2 + 0.6 \cdot 1.3 \end{bmatrix} = \begin{bmatrix} 0.72 \\ 1.41 \\ 1.10 \end{bmatrix}\] \[K_I = W^K \cdot \begin{bmatrix} 1.1 \\ 0.2 \\ 1.3 \end{bmatrix} = \begin{bmatrix} 0.6 \cdot 1.1 + 0.1 \cdot 0.2 + 0.3 \cdot 1.3 \\ 0.5 \cdot 1.1 + 0.7 \cdot 0.2 + 0.2 \cdot 1.3 \\ 0.3 \cdot 1.1 + 0.4 \cdot 0.2 + 0.8 \cdot 1.3 \end{bmatrix} = \begin{bmatrix} 1.07 \\ 0.95 \\ 1.45 \end{bmatrix}\] \[V_I = W^V \cdot \begin{bmatrix} 1.1 \\ 0.2 \\ 1.3 \end{bmatrix} = \begin{bmatrix} 0.1 \cdot 1.1 + 0.3 \cdot 0.2 + 0.5 \cdot 1.3 \\ 0.4 \cdot 1.1 + 0.5 \cdot 0.2 + 0.6 \cdot 1.3 \\ 0.2 \cdot 1.1 + 0.7 \cdot 0.2 + 0.8 \cdot 1.3 \end{bmatrix} = \begin{bmatrix} 0.82 \\ 1.32 \\ 1.40 \end{bmatrix}\]

The same computational method can be applied to obtain the $Q$, $K$, and $V$ for the words love and Transformer.

\[Q_{Love} = \begin{bmatrix} 0.34 \\ 0.55 \\ 0.71 \end{bmatrix}, \quad K_{Love} = \begin{bmatrix} 0.29 \\ 0.91 \\ 0.66 \end{bmatrix}, \quad V_{Love} = \begin{bmatrix} 0.45 \\ 0.75 \\ 0.97 \end{bmatrix}\] \[Q_{Transformer} = \begin{bmatrix} 0.68 \\ 0.62 \\ 0.37 \end{bmatrix}, \quad K_{Transformer} = \begin{bmatrix} 0.82 \\ 0.74 \\ 0.51 \end{bmatrix}, \quad V_{Transformer} = \begin{bmatrix} 0.21 \\ 0.63 \\ 0.41 \end{bmatrix}\]

Once we have the $Q$, $K$, and $V$, we can formally introduce the Scaled Dot-Product Attention formula. The core operation involves computing the dot product of the Query and Key, ${QK}^T$, scaling the result by $\frac{1}{\sqrt{d_k}}$ (where $d_k$ is the dimension of the Key), and then applying the softmax function to normalize the results. This yields attention weights, where each weight represents the relative focus on different words. These weights range between $0$ and $1$, and their sum equals $1$. A higher weight indicates greater attention to the corresponding word, and vice versa.

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]

The ${\sqrt{d_k}}$ in the formula serves as a scaling factor to prevent the dot product value ${QK}^T$ from becoming excessively large, ensuring gradient stability. As illustrated, without scaling, the dot product values (represented by the green line) would grow rapidly with increasing input values, potentially causing the softmax function to convert all large values into values close to 1, while driving smaller values towards 0, thus affecting the final attention distribution. Therefore, the scaling factor is essential.

After computing the $\frac{Q \cdot K^T}{\sqrt{d_k}}$, we obtain the attention weight distribution for the word “I”, with results of (2.139, 1.281, 1.267). This indicates that ‘I’ has the highest relevance to itself, followed by the second word, “love”, and finally the third word, “Transformer”.

Relevance between “I” and “I”:

\[\frac{Q_I \cdot K_I^T}{\sqrt{d_k}} = \frac{(0.72 \cdot 1.07 + 1.41 \cdot 0.95 + 1.10 \cdot 1.45)}{\sqrt{3}} = \frac{3.7049}{1.732} \approx 2.139\]

Relevance between “I” and “Love”:

\[\frac{Q_I \cdot K_{love}^T}{\sqrt{d_k}} = \frac{(0.72 \cdot 0.29 + 1.41 \cdot 0.91 + 1.10 \cdot 0.66)}{\sqrt{3}} = \frac{2.2179}{1.732} \approx 1.281\]

Relevance between “I” and “Transformer”:

\[\frac{Q_I \cdot K_{Transformer}^T}{\sqrt{d_k}} = \frac{(0.72 \cdot 0.82 + 1.41 \cdot 0.74 + 1.10 \cdot 0.51)}{\sqrt{3}} = \frac{2.1948}{1.732} \approx 1.267\]

Similarly, we can calculate the relevance of “love” and “Transformer” with the other words:

\[Love = (1.106, 0.616, 0.605)\] \[Transformer = (1.070, 0.581, 0.696)\]

Next, we perform the softmax operation, which allows the output to present a more intuitive probability distribution, making it easy to see. You may notice that before applying softmax, the values vary in magnitude, but after the softmax operation, they are transformed into probabilities that sum to 1, offering a clear and intuitive way to compare the relevance of each word.

\[\text{Softmax}(I) = (0.543, 0.230, 0.227)\] \[\text{Softmax}(love) = (0.451, 0.276, 0.273)\] \[\text{Softmax}(Transformer) = (0.435, 0.266, 0.299)\]

Finally, we use the attention weights calculated through softmax to weight the corresponding word’s $V$, and then sum them to obtain a final output vector that contains contextual information. This output vector not only captures the relevance between words but also retains each word’s semantic information, ensuring the model understands the meaning of each word within the broader context.

Use Attention(I) to show to calculation process:

\[\text{Attention}_I = \text{Softmax}(I) \cdot V_I + \text{Softmax}(\text{Love}) \cdot V_{\text{Love}} + \text{Softmax}(\text{Transformer}) \cdot V_{\text{Transformer}}\] \[= 0.543 \cdot \begin{bmatrix} 0.82 \\ 1.32 \\ 1.40 \end{bmatrix} + 0.230 \cdot \begin{bmatrix} 0.45 \\ 0.75 \\ 0.97 \end{bmatrix} + 0.227 \cdot \begin{bmatrix} 0.21 \\ 0.63 \\ 0.41 \end{bmatrix}\]

In the end, we get:

\[Attention_I = (0.596, 1.032, 1.076)\] \[Attention_{Love} = (0.551, 0.974, 1.011)\] \[Attention_{Transformer} = (0.539, 0.962, 0.990)\]

Multi-Head Attention

Understanding Scaled Dot-Product Attention makes it much easier to grasp Multi-Head Attention (MHA). In single-head attention, relevance weights between words are captured using just one set of Query, Key, and Value. However, this approach can only capture one type of contextual relationship. Since the semantic information in language is often complex and layered, we need the model to capture relationships from multiple perspectives. This is where the Multi-Head Attention mechanism becomes essential.

The core idea of Multi-Head Attention is to execute multiple Scaled Dot-Product Attention computations in parallel. Multi-Head Attention allows the model to perform attention operations using different Query, Key, and Value weight matrices in multiple distinct subspaces. This enables the model to understand the relationships between words from various angles and to capture more contextual information across different semantic spaces.

In terms of formulas, the specific process of Multi-Head Attention involves linearly transforming the input Query, Key, and Value through different weight matrices to create multiple heads. For each head, Scaled Dot-Product Attention is applied using the respective Query, Key, and Value. Finally, the outputs from all heads are concatenated and passed through a linear transformation to produce the final output vector.

The Multi-Head Attention (MHA) mechanism can be formally expressed as:

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_h) W^O\]

This formula shows that MHA concatenates the output from multiple attention heads and applies a linear transformation using a weight matrix $W^o$. By processing information from different heads simultaneously, MHA allows the model to capture various types of relationships within the data, helping it better understand complex dependencies and express nuanced contextual information.

Each individual head is computed as:

\[\text{head}_i = \text{Attention}(Q W_i^Q, K W_i^K, V W_i^V)\]

Here, the Query, Key, and Value are linearly transformed by weight matrices $W_i^Q$, $W_i^K$, and $W_i^V$ for each head, and then the Scaled Dot-Product Attention is applied. This ensures that each head captures different aspects of the input relationships.

Masked Multi-Head Attention

For models that primarily utilize the Decoder module (such as LLMs like GPT), the model autoregressively predicts the next word based on all previously input words. However, whether using Scaled Dot-Product Attention or Multi-Head Attention, the model calculates the relevance weights among all words during training, meaning it can effectively “see” all words in the entire sentence.

To prevent information leakage, we need to ensure that at time step $i$, the model can only infer based on words up to $i−1$ and cannot access words from time steps $i$ and beyond. This mechanism guarantees that when generating the next word, the model does not rely on future data, thereby simulating a realistic generation process.

To achieve this masking of future words, we introduce a masking mechanism in Multi-Head Attention. Specifically, after computing the dot product of the Query and Key $QK^T$, the model adds a masking matrix $M$. The role of this matrix is to mark the positions of all future words as negative infinity ($-∞$), thereby preventing their relevance weights from being included in the attention allocation.

The resulting operations can be expressed as:

\[QK^T + M\]

And the softmax normalization is applied as follows:

\[\text{softmax}(QK^T + M)\]

Since the softmax function exponentiates the input values, any value that is added to $-∞$ will converge to 0 after the exponential function. Therefore, the masking matrix $M$ ensures that the relevance weight of those future words, after the softmax operation, will be close to 0, meaning the model will assign almost no attention to these words. The final attention mechanism is expressed as:

\[\text{Masked Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}} + M \right)V\]

By employing masking mechanism, Masked Multi-Head Attention effectively ensures that the model focuses solely on the current word and its preceding words, thus avoiding the issue of future information leakage. This guarantees that in generation tasks, the model can rely on past contextual information to predict the next word without “sneaking a peek” at subsequent words.

Feed-Forward Network

Both Encoder and Decoder modules, in addition to the attention mechanism, there is a Feed-Forward Network (FFN) layer.

The FFN can be simply understood as a “feature enhancer,” which processes the input for each word individually through two fully connected neural network layers. Unlike the Attention mechanism, the FFN does not consider the relationships between words; instead, it performs deeper computations on each word independently, helping the model better understand the characteristics of the word in its current context.

The basic workflow is as follows:

First, the input $x$ x undergoes a linear transformation, similar to multiplying the input vector $x$ by the weight matrix $W_1$ and adding a bias $b_1$ . This step is akin to applying a “scaling and shifting” operation to the input, representing a simple linear transformation.

\[z_1 = xW_1 + b_1\]

Second, to enable the model to learn complex relationships, we need to introduce non-linearity. The ReLU activation function is used here, which transforms all negative values to 0 while retaining positive values. This helps the model incorporate non-linearity, allowing it to learn more complex features.

\[a_1 = \text{max}(0, z_1)\]

Finally, the output $a_1$ after the ReLU processing undergoes another linear transformation, multiplying by the weight matrix $W_2$ and adding the bias $b_2$ to produce the final output. The purpose of this step is to further transform the features into the output space.

\[FFN(x) = a_1W_2 + b_2\]

The final result is generated after two linear transformations and one non-linear processing step. This process strengthens and adjusts the features of each word, enabling the model to better understand their meanings within the context.

\[FFN(x) = \text{max}(0, xW_1 + b_1)W_2 + b_2\]

Residual Connections and Layer Normalization

From the architecture of Transformer, you can observe that after completing Multi-Head Attention (MHA), Masked MHA, and Feed-Forward Network (FFN), an Add & Norm operation is performed on the output. Here, Add refers to Residual Connections, which sums the output of the attention mechanism with the original input, while Norm represents Layer Normalization. After the Residual Connections, the result undergoes normalization to ensure that the output maintains the same mean and variance across each feature dimension, thereby accelerating model convergence and ensuring stability during training.

The introduction of Residual Connections addresses the issues of vanishing and exploding gradients in deep neural networks. As the number of layers in the network increases, gradients may become unstable during backpropagation, making model training challenging. Residual Connections effectively mitigate this problem by allowing the input to bypass the layer and be added to the layer’s output.

Specifically, suppose the input to a certain layer is $x$ and the output of that layer is $\text{SubLayer}(x)$. The formula for Residual Connections can be expressed as:

\[y_1 = x + \text{SubLayer}(x)\]

Where $y1$ is the output after the residual connection. This approach allows the model to retain the original input information while incorporating the transformation output from the current layer. Even in networks with many layers, the input information can still be smoothly propagated to deeper levels, effectively preventing the vanishing gradient problem.

Layer Normalization, on the other hand, is a regularization technique primarily used to enhance model training effectiveness. Normalization is applied after the output of each sublayer, ensuring that the output maintains the same mean and variance across all feature dimensions. This contributes to model stability and accelerates convergence.

The formula can be represented as:

\[y_2 = \text{LayerNorm}(y_1)\]

Which is equivalent to:

\[y_2 = \text{LayerNorm}(x + \text{SubLayer}(x))\]

This combination ensures that the inputs and outputs at each layer of the model can be propagated stably, addressing gradient issues in model training.

Encoder-Decoder Architecture

In previous discussions, we mentioned that the Encoder and Decoder modules in the Transformer architecture can be used independently. However, in tasks such as machine translation, the Encoder and Decoder modules are typically used together.

In this model architecture, the Encoder first receives the input in the source language and converts each word into a high-dimensional vector representation through Input Embedding. These embeddings then pass through multiple layers of the Encoder, with the output of each layer serving as the input for the next, ultimately generating an encoded representation that captures the global context. This encoded representation contains the relative relationships and global dependencies of all words in the source language, providing the necessary contextual information for the subsequent Decoder.

Next, the Decoder module begins its work. When generating the target sequence, the Decoder first receives the previously generated words (initially empty or represented by a special token, such as <START>), and these inputs are fed into the Masked Multi-Head Attention (Masked MHA) layer. In the Masked MHA, the model can only focus on the current word and the previously generated words, while future words are masked, ensuring that the model does not “see” subsequent words during the generation process.

After the computation of the Masked MHA, the results are combined with the output from the Encoder module. At this point, the second attention module in the Decoder, Multi-Head Attention (MHA), begins to operate. It receives the global context representation from the Encoder and helps generate a more accurate translation by calculating the dependencies between the target language and the source language. In this process, MHA combines the context information generated by the Encoder as Key and Value with the output of the Masked MHA (Query) to form more precise predictions for the target words.

The collaboration between Masked MHA and MHA is crucial:

Masked MHA ensures that the Decoder only relies on the already generated portion, guaranteeing that each step of generation is based on past information.
MHA ensures that the Decoder can effectively utilize the global context information from the Encoder, specifically the relationships among the words in the source language, to better generate the target vocabulary.

By combining these two attention mechanisms, the Decoder can consider both the already generated portion (through Masked MHA) and the global context information from the source language (through MHA) at each step of generating the target sequence, ultimately producing high-quality output.

The output of each step in the Decoder is then passed to the subsequent Feed-Forward Network (FFN), followed by Residual Connections and Layer Normalization, providing input for the next layer of the Decoder and continuing the prediction of the next word. This process repeats until the complete target sequence is generated.

By understanding this example of machine translation, you can clearly grasp the workflow within the multi-layer architecture of the Transformer. Whether it’s how the Encoder captures global context through multi-layer self-attention or how the Masked MHA and MHA in the Decoder work together to utilize past generated words and the context of the source language, the entire process demonstrates the powerful capabilities of the Transformer in handling sequence-to-sequence tasks. Once you master this workflow, you’ll find it easy to understand the applications and extensions of the Transformer in other tasks.

Conclusion

Good news! You’ve now gained a solid understanding of the entire Transformer architecture. However, since the Transformer was introduced seven years ago, this foundational knowledge is not enough for our curiosity.

I’m particularly intrigued by the optimizations made to the Transformer architecture by BERT and GPTs, as well as the latest advancements in LLMs research. These improvements have pushed the boundaries of what Transformers can achieve. Additionally, I’m eager to explore the applications of Transformers in areas like audio and video processing, as well as projects such as AlphaStar, AlphaCode, and AlphaFold2. While these projects don’t use the Transformer architecture, they represent some groundbreaking work in AI. My research into these topics might lead to the release of version 2.0 and 3.0 of the “Understanding Transformer” series.

Furthermore, after D-Day, I’ve developed several ideas for experimental projects, and I plan to begin building proof-of-concept versions. I believe we are still underestimating the profound and comprehensive impact AI will have on society. Over the next two years, this trend will likely accelerate, bringing even more powerful AI techs. This will be especially true as AI-enabled robotics begin entering consumer hardware, and AI is integrated with space technology. The future promises to be incredibly exciting, so let’s ride this wild rodeo and see how far it goes.