1. The Breaking Point: When RNNs Hit the Wall
For years, sequence modeling was ruled by RNNs and LSTMs. They were the go-to models for text, speech, and time-series data, anything where order mattered.
The idea behind them was simple but clever: process data one step at a time, and pass information forward through a hidden state. This way, the model could ârememberâ previous inputs as it read new ones.
It worked well for short sequences. But the cracks appeared quickly.
The Real Problems
1. Vanishing/Exploding Gradients - the famous one everyone talks about. But hereâs what matters practically: Even with gradient clipping and LSTMs, youâre still fighting an uphill battle. Information from token 1 has to survive 100+ sequential transformations to influence token 100. Thatâs a game of telephone with exponential decay.
2. Sequential Bottleneck - this is the killer. Every step waits for the previous one. Your GPU sits there, mostly idle, processing one token at a time. Itâs like having a 100-lane highway but being forced to drive single-file.
3. The Hidden State Compression Problem- hereâs the intuition nobody tells you:
Imagine I tell you a story and ask: âNow summarize everything important in exactly 512 numbers.â Then I add more story. âOkay, still 512 numbers. Donât forget the beginning!â
Thatâs what we asked RNNs to do.
LSTMs added âgatesâ - like giving you permission to forget certain things. Better, but still fundamentally a lossy compression game.
The Insight That Changed Everything
In 2014, Bahdanau introduced attention for neural machine translation. The key insight wasnât the math - it was the question:
âWhy compress the entire source sentence into one vector when the decoder can just look back and grab what it needs?â
Itâs the difference between:
Taking notes on a book, then writing an essay from memory (RNN)
Writing an essay with the book open, referencing specific passages (Attention)
But they still used RNNs to process the sequence sequentially.
In 2017, Vaswani et al. asked the radical question:
âWhat if we throw out recurrence entirely and use only attention?â
That paper âAttention Is All You Needâ became the most cited AI paper of the decade.
2. Architecture: Self-Attention Under the Hood
Let me show you what actually happens inside a Transformer, with the intuition first, math second.
2.1 The Core Idea: Attention as Database Lookup
Think of self-attention as a differentiable database query.
Every token in your sequence is simultaneously:
A query asking: âWhat information do I need?â
A key announcing: âI contain this type of informationâ
A value holding: âHereâs my actual contentâ
When processing the word âbankâ in âI withdrew money from the bankâ, the token:
Queries for context about transactions, finance
Keys from nearby tokens like âmoneyâ and âwithdrewâ light up
Values from those tokens flow into âbankââs new representation
The genius: every token queries every other token simultaneously.
2.2 The Math (Now That You Get It)
For each token, we create three vectors via learned projections:
Query (Q): What am I looking for? Key (K): What do I contain?
Value (V): What information do I carry?
Compute relevance scores between all query-key pairs:
Score(Q_i, K_j) = Q_i ¡ K_j
Scale to prevent saturation (critical for training stability):
Scaled Score = (Q_i K_j^T) / âd_k
Why divide by âd_k? Because dot products grow with dimensionality. Without scaling, softmax gets extreme values (0.00001, 0.00001, 0.99998) instead of smooth distributions. This kills gradient flow.
Apply softmax to get attention distribution:
Attention Weights = softmax(QK^T / âd_k)
Compute weighted sum of values:
Self-Attention(Q, K, V) = softmax(QK^T / âd_k)V
All tokens processed in parallel, one massive matrix multiplication.
2.3 Visual: What Attention Actually Looks Like
Input: âThe cat sat on the matâ
Token: âsatâ
ââ High attention to: âcatâ (subject), âmatâ (location)
ââ Medium attention to: âonâ, âtheâ
ââ Low attention to: âTheâ (first token)
Token: âmatâ
ââ High attention to: âsatâ (action), âonâ (relation)
ââ Medium attention to: âtheâ (determiner)
ââ Low attention to: âTheâ, âcatâ
Each token builds a new representation by pulling information from relevant tokens, weighted by attention scores.
2.4 Multi-Head Attention: Why One Attention Isnât Enough
Hereâs the non-obvious insight: different types of relationships matter simultaneously.
Consider âThe chef who runs the restaurant cooked the mealâ
You need to track:
Syntactic structure: âwhoâ refers to âchefâ, not ârestaurantâ
Semantic roles: âchefâ is the agent, âmealâ is the bject
Long-range dependencies: âcookedâ connects to âchefâ across 5 words
Local context: âthe restaurantâ is a noun phrase unit
Single attention canât capture all these patterns optimally.
Solution: Run h attention operations in parallel (typically 8-16 heads).
MultiHead(Q,K,V) = Concat(head_1, ..., head_h)W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
Each head learns different relationship patterns:
Head 1: Subject-verb relationships
Head 2: Noun-modifier pairs
Head 3: Long-range dependencies
Head 4: Positional/sequential patterns
...and so on
2.5 Positional Encoding: Teaching Order Without Recurrence
Problem: Self-attention is permutation-invariant. âDog bites manâ and âMan bites dogâ produce identical attention patterns.
Solution: Inject position information directly into embeddings.
The original paper used sinusoidal encodings:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Why sinusoids? Two clever properties:
Relative positions: PE(pos+k) can be expressed as a linear function of PE(pos)
Unbounded length: Works for any sequence length, no training needed
Modern models often use learned positional embeddings (GPT) or rotary embeddings (RoPE in LLaMA) which have better extrapolation properties.
3. Why This Architecture Won
Let me tell you what actually mattered for Transformersâ success and itâs not what most people emphasize.
Parallelization: The GPU Unlock
RNN/LSTM:
Step 1: Process token 1 [GPU: 5% utilized]
Step 2: Process token 2 [GPU: 5% utilized]
Step 3: Process token 3 [GPU: 5% utilized]
...
Step 512: Process token 512 [GPU: 5% utilized]
Transformer:
Step 1: Process ALL 512 tokens simultaneously [GPU: 95% utilized]
This isnât just faster itâs 2-3 orders of magnitude faster for long sequences. This is what made GPT-3 (175B parameters) feasible to train.
Global Context: See Everything, Attend to What Matters
RNNs forced information through a bottleneck. Transformers let every token directly access every other token.
In âThe trophy doesnât fit in the suitcase because itâs too bigâ:
LSTM struggles to connect âitâ â âtrophyâ across 7 tokens
Transformer directly computes attention between âitâ and both âtrophyâ and âsuitcaseâ
The model learns âbigâ + âdoesnât fitâ â probably referring to trophy, not suitcase.
Engineering Beauty: Why Systems Engineers Love Transformers
Stateless: No hidden state to serialize/deserialize between steps
Cacheable: In autoregressive generation, previous token representations are cached (KV cache)
Analyzable: Attention weights are interpretable- you can visualize what the model âlooks atâ
Modular: Easy to swap encoders/decoders, add/remove layers, change attention patterns
4. The Complexity Trade-off (And Why We Accept It)
The O(n²) Elephant in the Room
Self-attention computes interactions between all pairs of tokens:
Sequence length 512: 262,144 interactions
Sequence length 2048: 4,194,304 interactions
Sequence length 8192: 67,108,864 interactions
Complexity: O(n² ¡ d) time, O(n²) memory
For context: RNN is O(n ¡ d²) - linear in sequence length, quadratic in dimension.
So why did we accept quadratic complexity?
Three reasons:
GPUs love matrix multiplication : O(n²) on a GPU is often faster than O(n) on a CPU
Most NLP tasks used short sequences (â¤512 tokens) where n² wasnât prohibitive
The performance gain was massive - quadratic cost, 10x better accuracy
Modern Solutions
When quadratic became a problem (long documents, DNA sequences, code):
Sparse Attention (Longformer, BigBird): Only attend to local neighbors + global tokens + random samples
Reduces complexity to O(n ¡ k) where k << n
Loses some global context
Linear Attention (Performer, Linformer):
Approximate softmax(QK^T)V with lower-rank operations
O(n) complexity
Slight accuracy drop
FlashAttention (2022): Donât change the algorithm , optimize GPU memory access patterns
Same O(n²) complexity
3x faster, 10x less memory
This is what powers 100K+ context windows today
5. Interview Deep-Dive: Questions That Matter
Q1. Why did RNNs struggle with long-term dependencies?
Surface answer: Vanishing gradients.
Deep answer: Sequential processing creates a gradient path of length n. Even with careful initialization and gating (LSTM), each step multiplies by a matrix. After 100+ steps, either:
Products converge to zero (vanishing)
Products explode (unbounded)
The gradient w.r.t. token 1 has to flow through 100+ matrix multiplications. Attention creates direct paths - gradient flows in O(1) steps regardless of distance.
Q2. Whatâs the intuition behind Q, K, V?
Analogy: Search engine.
Query (Q): Your search terms , what youâre looking for
Key (K): Document titles/metadata , what each document is about
Value (V): Document content , actual information you retrieve
You compute relevance (Q¡K), rank results (softmax), and retrieve content (weighted V).
Every token is simultaneously searching and being searched.
Q3. Why divide by âd_k in scaled dot-product attention?
Surface answer: To prevent large dot products.
The real reason: Dot product magnitude grows with dimensionality.
If Q and K are unit-variance, Q¡K has variance d_k. For d_k = 512, typical dot products are in range [-50, 50]. After softmax, you get extreme distributions: (0.00001, 0.99998, 0.00001)
This creates two problems:
Saturation: Softmax derivatives â 0, killing gradients
Instability: Small input changes cause massive output swings
Dividing by âd_k normalizes variance back to 1, keeping softmax in the âsoftâ regime where gradients are healthy.
Q4. How do Transformers enable parallel computation?
Key insight: Attention is a three-matrix multiplication problem.
Attention = softmax(QK^T / âd_k) ¡ V
QK^T: (n Ă d) ¡ (d Ă n) â (n Ă n) attention matrix
softmax: element-wise, fully parallelizable
Attention ¡ V: (n Ă n) ¡ (n Ă d) â (n Ă d) output
All token interactions computed in one batched operation. RNNs required n sequential steps.
Modern GPUs do matrix multiplication at 200+ TFLOPS . Transformers exploit this perfectly.
Q5. Whatâs the difference between encoder-only and decoder-only Transformers?
Encoder-only (BERT):
Bidirectional attention - each token sees past AND future
Good for: classification, NER, Q&A (understanding tasks)
Training: Masked language modeling (predict random masked tokens)
Decoder-only (GPT):
Causal attention - token i can only see tokens 1...i (via attention mask)
Good for: text generation, completion (generative tasks)
Training: Next token prediction (autoregressive language modeling)
Encoder-Decoder (T5, BART):
Encoder: bidirectional on input
Decoder: causal, cross-attends to encoder outputs
Good for: translation, summarization (seq2seq tasks)
Q6. Whatâs the main bottleneck of Transformers?
Training: Compute (O(n² ¡ d) attention + O(n ¡ d²) FFN) Inference: Memory for KV cache
At inference, we cache K and V for all previous tokens. For 8K context, 32 layers, d=4096: ~2GB per request. This is why âcontext lengthâ is expensive - itâs mostly a memory problem.
Q7. Why do we need positional encoding?
Self-attention is a set operation - order-invariant.
Without positional info:
âDog bites manâ = âMan bites dogâ
âNot badâ = âBad notâ
Positional encoding adds order signal directly to embeddings, so the model can learn position-dependent patterns.
Why not just use token position as a feature? Because:
Absolute position isnât what matters - âthird wordâ means nothing
Relative position matters more distance and direction between tokens
Sinusoidal encoding captures relative position implicitly via phase relationships
Q8. How do you handle sequences longer than training length?
Problem: Train on 512 tokens, inference on 2048 tokens.
Solutions:
Sinusoidal PE: Extrapolates naturally (original Transformer)
Learned PE: Interpolate embeddings (okay but degraded)
ALiBi: Bias attention by relative distance (no explicit encoding)
RoPE: Rotate Q,K based on position (used in LLaMA, best extrapolation)
Modern long-context models (32K, 100K+) use RoPE + careful finetuning on longer sequences.
The Bigger Picture
Transformers didnât just improve NLP - they unified sequence modeling across domains.
Same architecture, different data:
Text â GPT, BERT, T5
Images â Vision Transformer (ViT)
Audio â Whisper, AudioLM
Video â VideoGPT, Phenaki
Molecules â AlphaFold (protein structures)
Code â Codex, GitHub Copilot
Multimodal â CLIP, Flamingo, GPT-4
The insight: Everything can be tokenized into sequences. And attention is a universal way to model relationships.
đ References & Further Reading
Here are some high-quality papers, articles, and visual guides to explore if you want to go deeper:
đš Foundational Papers
Vaswani et al. (2017) â âAttention Is All You Needâ, NeurIPS 2017
Bahdanau et al. (2014) â âNeural Machine Translation by Jointly Learning to Align and Translateâ
Hochreiter & Schmidhuber (1997) â âLong Short-Term Memoryâ
https://www.bioinf.jku.at/publications/older/2604.pdf
đš Technical Deep Dives
đš Videos & Talks
Yannic Kilcher â âAttention Is All You Need â Paper Explainedâ (YouTube)
Andrej Karpathy â âLetâs build GPT from scratchâ (YouTube, 2023)
DeepLearning.AI â âTransformers Explainedâ short course by Andrew Ng
Whatâs Next?
This post covered why Transformers emerged and what makes them tick.
Next in the series:
Post 2: Deep dive into attention mechanisms visualizing heads, understanding learned patterns
Post 3: Scaling laws and emergent abilities why bigger models suddenly get qualitatively smarter
Post 4: From Transformers to LLMs training objectives, instruction tuning, RLHF
Question for you: What was the âaha!â moment that made Transformers click for you? Drop a comment . I read every one.
If you found this valuable, share it with someone learning ML. This series is my attempt to document everything I wish I knew when I started building with Transformers.



