← back to blog

AI/ML

Understanding Transformer Architecture from Scratch

A visual walkthrough of attention mechanisms, positional encoding, and why transformers changed everything

2026-04-08
transformersattentiondeep-learningarchitecture

Why Transformers Matter

The transformer architecture, introduced in "Attention Is All You Need" (Vaswani et al., 2017), fundamentally changed how we think about sequence modeling. Before transformers, RNNs and LSTMs dominated — sequential by nature, slow to train, and prone to forgetting long-range dependencies.

Transformers replaced all of that with a single, elegant idea: self-attention.

The Core Intuition

Imagine reading a sentence:

"The animal didn't cross the street because it was too tired."

When you read "it", your brain instantly knows it refers to "animal", not "street". Self-attention gives neural networks this same ability — every token can attend to every other token, weighted by relevance.

Architecture Overview

graph TB
    A[Input Embedding] --> B[Positional Encoding]
    B --> C[Multi-Head Attention]
    C --> D[Add & Norm]
    D --> E[Feed Forward]
    E --> F[Add & Norm]
    F --> G[Output]
    
    subgraph "Encoder Block (×N)"
        C
        D
        E
        F
    end

Self-Attention: Step by Step

For each token, we compute three vectors:

  • Query (Q): "What am I looking for?"
  • Key (K): "What do I contain?"
  • Value (V): "What information do I provide?"

The attention score between tokens is:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) × V
import torch
import torch.nn.functional as F

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / torch.sqrt(torch.tensor(d_k, dtype=torch.float32))
    
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    
    weights = F.softmax(scores, dim=-1)
    return torch.matmul(weights, V), weights

Multi-Head Attention

Instead of computing attention once, we compute it h times in parallel with different learned projections. This lets the model attend to information from different representation subspaces:

class MultiHeadAttention(torch.nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.d_k = d_model // n_heads
        
        self.W_q = torch.nn.Linear(d_model, d_model)
        self.W_k = torch.nn.Linear(d_model, d_model)
        self.W_v = torch.nn.Linear(d_model, d_model)
        self.W_o = torch.nn.Linear(d_model, d_model)
    
    def forward(self, Q, K, V, mask=None):
        batch_size = Q.size(0)
        
        # Linear projections and reshape
        Q = self.W_q(Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        K = self.W_k(K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        V = self.W_v(V).view(batch_size, -1, self.n_heads, self.d_k).transpose(1, 2)
        
        # Attention
        out, weights = scaled_dot_product_attention(Q, K, V, mask)
        
        # Concatenate heads
        out = out.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_k)
        return self.W_o(out)

Why This Matters for Practitioners

If you're building with LLMs today — fine-tuning, RAG pipelines, or agent architectures — understanding transformers isn't optional. It's the difference between:

  • Blindly using an API and having no idea why your outputs are inconsistent
  • Understanding context windows, embedding dimensions, and why chunking strategies matter for retrieval

The architecture is the foundation. Everything else — GPT, BERT, LLaMA, Claude — is a variation on this theme.


Next in this series: Positional encoding deep-dive, and why RoPE changed the game for long-context models.

← all posts

Ayoush Chourasia · 2026-04-08