AI/ML
Understanding Transformer Architecture from Scratch
A visual walkthrough of attention mechanisms, positional encoding, and why transformers changed everything
Why Transformers Matter
The transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), fundamentally changed how we think about sequence modeling. Before transformers, RNNs and LSTMs dominated — sequential by nature, slow to train, and prone to forgetting long-range dependencies.
Transformers replaced all of that with a single, elegant idea: self-attention.
The Core Intuition
Imagine reading a sentence:
"The animal didn't cross the street because it was too tired."
When you read "it", your brain instantly knows it refers to "animal", not "street". Self-attention gives neural networks this same ability — every token can attend to every other token, weighted by relevance.
Architecture Overview
graph TB
A[Input Embedding] --> B[Positional Encoding]
B --> C[Multi-Head Attention]
C --> D[Add & Norm]
D --> E[Feed Forward]
E --> F[Add & Norm]
F --> G[Output]
subgraph "Encoder Block (×N)"
C
D
E
F
endSelf-Attention: Step by Step
For each token, we compute three vectors:
- Query (Q): "What am I looking for?"
- Key (K): "What do I contain?"
- Value (V): "What information do I provide?"
The attention score between tokens is:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) × Vimport torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, V), weightsMulti-Head Attention
Instead of computing attention once, we compute it h times in parallel with different learned projections. This lets the model attend to information from different representation subspaces:
class MultiHeadAttention(torch.nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.n_heads = n_heads
self.d_k = d_model // n_heads
self.W_q = torch.nn.Linear(d_model, d_model)
self.W_k = torch.nn.Linear(d_model, d_model)
self.W_v = torch.nn.Linear(d_model, d_model)
self.W_o = torch.nn.Linear(d_model, d_model)
def forward(self, Q, K, V, mask=None):
batch_size = Q.size(0)
# Linear projections and reshape
Q = self.W_q(Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
K = self.W_k(K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
V = self.W_v(V).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
# Attention
out, weights = scaled_dot_product_attention(Q, K, V, mask)
# Concatenate heads
out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_k)
return self.W_o(out)Why This Matters for Practitioners
If you're building with LLMs today — fine-tuning, RAG pipelines, or agent architectures — understanding transformers isn't optional. It's the difference between:
- Blindly using an API and having no idea why your outputs are inconsistent
- Understanding context windows, embedding dimensions, and why chunking strategies matter for retrieval
The architecture is the foundation. Everything else — GPT, BERT, LLaMA, Claude — is a variation on this theme.
Next in this series: Positional encoding deep-dive, and why RoPE changed the game for long-context models.
Ayoush Chourasia · 2026-04-08