Transformers

Transformers are a sequence-to-sequence model: given an input sequence, produce an output sequence.
Architecture: an Encoder processes the input; a Decoder generates the output autoregressively.

\[\text{(En) "I am sorry"} \xrightarrow{\text{Encoder}} \xrightarrow{\text{Decoder}} \texttt{<start>}\ \text{Je suis désolé}\ \texttt{<end>}\]

Autoregressive: the decoder generates one token at a time, conditioning on all previously generated tokens.

Input Text Sequence Representation
Encoders
From MLP to Attention
Self-Attention & Multi-Head Attention
Decoders
Masked Attention
Encoder-Decoder Cross Attention

Appendix

Citation

Input Text Sequence Representation

encoder-decoder

Tokenization

Two approaches to representing input text:

1. One-hot encoding

No semantic similarity or meaning of words encoded.

2. Token embedding

Encodes semantic similarity between words.
Embedding matrix is learned (Lookup Table).
Each token embedding is stored as a column vector.

token embedding

\[\begin{aligned} \text{where} \\ W_E &\equiv \text{embedding matrix, } d \times \text{\#tokens} \\ d &\equiv \text{embedding dimension} \end{aligned}\]

Why We Need Context

Many words have different meanings in different contexts:
- “I bought an apple & an orange”
- “I bought an apple watch”
We need to rely on context to resolve the ambiguity.

Encoders

The encoder pipeline:

\[\text{Input} \to \text{token} + \text{POS embed} \to \text{Norm} \to \text{MHA(self)} \to \text{Add} \to \text{Norm} \to \text{FFN} \to \text{Add} \to \text{Output}\]

Input tokens are embedded using $W_E$ and combined with positional encodings to produce the input matrix $X$.
The architecture stacks: Multi-Head Self-Attention + residual, then FeedForward Network + residual.

encoder

From MLP to Attention

MLP only

MLP stands for Multilayer Perceptron
No contextual information; each token is processed independently.

Concatenation of nearby token embeddings before MLP

Need a sufficiently large window to cover the entire input sequence.
Cannot handle variable sequence lengths.
Requires many model parameters.

Attention

Use token similarity to determine the relevance of each token to every other token by performing a dot product.
Allows the model to dynamically weight which parts of the input are relevant for each position.

Self-Attention & Multi-Head Attention

Self-Attention

\[Q = XW_Q, \quad K = XW_K, \quad V = XW_V\] \[\text{head}_i = \text{Softmax}\!\left(\frac{QK^T}{\sqrt{d_k}}\right)V = \text{Attention}(Q, K, V)\]

Multi-Head Attention (MHA)

\[\text{MHA}(X) = \text{multi-head}(Q, K, V) = \text{Concat}(h_1, h_2, \ldots, h_H)\, W_O = Z\] \[\begin{aligned} \text{where} \\ h_i &\equiv i\text{-th attention head} \\ W_O &\equiv \text{output projection matrix} \\ Z &\equiv \text{final output} \end{aligned}\]

Decoders

The decoder pipeline:

\[\text{Masked MHA} \to \text{Cross-Attn} \to \text{FFN} \to \text{Linear} \to \text{Softmax} \to \text{Output}\]

Decoder

The decoder takes as input the previously generated tokens $(z_i)$ along with their positional encodings $(p_i)$, and at each step attends both to itself (masked) and to the encoder output (cross attention).

Masked Attention

In the decoder, we use masked (causal) self-attention to prevent the decoder from attending to future tokens:

\[\text{Masked Attn}(Q, K, V) = \text{Softmax}\!\left(\frac{QK^T}{\sqrt{d_k}} + M\right)V\] \[\begin{aligned} \text{where} \\ M &\equiv \text{lookahead mask} \end{aligned}\] \[M = \begin{bmatrix} 0 & -\infty & -\infty & -\infty \\ 0 & 0 & -\infty & -\infty \\ 0 & 0 & 0 & -\infty \\ 0 & 0 & 0 & 0 \end{bmatrix}\]

Adding $-\infty$ to future positions drives their softmax weights to zero, ensuring position $i$ can only attend to positions $\leq i$.

Encoder-Decoder Cross Attention

The decoder queries the encoder output $E$ (encoder output) to incorporate source context:

\[\begin{aligned} Y' &= Y + \text{Masked MHA}(\text{Norm}(Y)) \\[6pt] Q &= Y' W_Q^{\text{dec}} \\ K^{\text{enc}} &= E W_K^{\text{enc}} \\ V^{\text{enc}} &= E W_V^{\text{enc}} \end{aligned}\] \[\begin{aligned} \text{where} \\ E &\equiv \text{encoder output} \end{aligned}\] \[\text{Cross Attn}(Y', E) = \text{Softmax}\!\left(\frac{Q K_{\text{enc}}^T}{\sqrt{d_k}}\right)V^{\text{enc}}\]

Then the rest of the decoder:

\[\begin{aligned} Y'' &= Y' + \text{Cross Attn}(\text{Norm}(Y'),\ E) \\ Y^O &= Y'' + \text{FFN}(\text{Norm}(Y'')) \\ D &= Y^O \\ \text{logits} &= DW^{\text{out}} + b \\ P(y_t) &= \text{Softmax}(\text{logits}_t) \end{aligned}\] \[\begin{aligned} \text{where} \\ D &\equiv \text{decoder output} \\ W^{\text{out}} &\equiv \text{output projection to vocabulary} \\ P(y_t) &\equiv \text{probability distribution over next token} \end{aligned}\]

Citation

If you found this blog post helpful, please consider citing it:

@article{obasi2026transformers,
  title   = "Transformers",
  author  = "Obasi, Chizoba",
  journal = "chizkidd.github.io",
  year    = "2026",
  month   = "Apr",
  url     = "https://chizkidd.github.io/2026/04/05/transformers/"
}

Table of Contents