Attention Is All You Need: The Transformer

Paper Overview

Title: Attention Is All You Need
Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, Illia Polosukhin (Google Brain / Research)
Published: NeurIPS 2017
Context: Proposed for machine translation; became the substrate of essentially all modern AI — and therefore of two sections of this fieldbook

TL;DR

The Transformer removed recurrence from sequence models and replaced it with self-attention: every token computes its representation by directly attending to every other token, in parallel. That one move traded an O(n) sequential dependency for an O(n²) parallelizable computation — exactly the trade GPUs wanted — and unlocked the scaling era: bigger models, bigger data, predictable returns (scaling laws). For a systems audience, the paper matters because its architecture is today's workload: the n² attention term and the per-token KV cache dictate why LLM serving is memory-bandwidth-bound, why prefill and decode are different regimes, why context windows cost what they cost, and why a decade of systems work (FlashAttention, PagedAttention, MQA/GQA, MoE) is essentially a campaign against this paper's two cost terms.

What It Replaced, and Why That Mattered to Hardware

Pre-2017 sequence models (RNNs/LSTMs) processed tokens one at a time — step t needs step t−1's hidden state. That serial chain caps GPU utilization regardless of model size, and information from distant tokens must survive a long chain of state updates (vanishing context). The Transformer's bet:

	RNN/LSTM	Transformer
Long-range dependency path	O(n) steps of state decay	O(1) — direct attention edge
Training parallelism over sequence	None (sequential)	Full (every position at once)
Cost per layer	O(n·d²)	O(n²·d) attention + O(n·d²) FFN
Hardware fit	Poor (dependent ops)	Excellent (dense matmuls)

The paper's deepest insight is hardware-shaped: an asymptotically worse sequence cost (n²) won because it converts into dense matrix multiplications that saturate accelerators. Architecture–hardware co-fit beats FLOP counting — a lesson that has repeated through the whole accelerator era.

The Mechanism in One Pass

Each token projects into a query (what am I looking for?), key (what do I contain?), and value (what do I contribute?). Attention weights = softmax of all query·key similarities; the output mixes values accordingly. Multiple heads run this in parallel subspaces (syntax here, coreference there); residual connections and normalization make 100+-layer stacks trainable. The original was an encoder-decoder for translation; the lineage that conquered everything is the decoder-only variant (GPT-style): predict the next token with causally-masked attention, which makes every position a training example and the objective embarrassingly self-supervised.

Why This Paper Is a Systems Paper in 2026

Every operational property of LLM infrastructure traces to the architecture:

Prefill vs decode asymmetry. Processing the prompt is one big parallel matmul pass (compute-bound); generating runs the whole stack once per output token (memory-bandwidth-bound, sequential again — autoregression reintroduced the serial chain, but only at inference). This is the two-regime split that disaggregated serving exists to exploit.
The KV cache is the paper's data structure made operational. Causal attention lets each generated token reuse all previous keys/values — caching them avoids quadratic re-computation but costs layers × heads × d × 2 × seq_len per sequence: the memory object that PagedAttention virtualizes, prefix caching shares, MQA/GQA shrink (fewer K/V heads), and context-management budgets exist to contain.
Context length pricing is the n² + cache term. Long-context features, prompt-caching discounts, and "lost in the middle" behavior are all downstream of how attention cost and memory scale with sequence length; FlashAttention's contribution was IO-aware exact attention (tiling to keep the n² intermediate out of HBM) — a systems fix, not a model change.
Scaling laws made capacity planning possible. Because the architecture scales smoothly, loss vs (params, data, compute) became predictable (Kaplan et al.'s scaling laws, then Chinchilla's compute-optimal correction) — turning model training into an engineering discipline with budgets, and inference fleets into unit-economics problems.
The FFN is where the parameters live, which is why Mixture-of-Experts (Shazeer's other 2017 idea) sparsifies that — modern MoE serving (expert parallelism, all-to-all routing) is attention's sibling cost-battle.
Even the agentic stack inherits its shape: tokens-in/tokens-out autoregression is why harness engineering obsesses over context budgets, append-only prompts, and cache-friendly prefixes.

Influence on System Design

One architecture, every modality: language (GPT/Claude/Gemini lineages), vision (ViT), audio, code, protein folding — the consolidation onto a single workload is why an entire hardware-software stack (accelerators, serving engines, attention kernels) could co-evolve around it.
It created the workload class this book's LLM Systems section covers — the first new first-class datacenter workload since web serving and MapReduce-style analytics, with its own storage hierarchy (HBM/KV/prefix caches), schedulers (continuous batching), and failure modes.
The bitter-lesson vindication: general architecture + scale + data beat task-specific cleverness; the paper is the strongest single data point for designing systems that ride compute curves rather than fight them.
Eight authors, one citation count north of anything else this century — and the most consequential sentence remains the title's claim that the previously-auxiliary mechanism was, alone, enough.

References

Attention Is All You Need (NeurIPS 2017)
The Illustrated Transformer — Jay Alammar; the canonical visual walkthrough
FlashAttention and Efficient Memory Management for LLM Serving with PagedAttention — the systems campaign against the n² and KV terms
Scaling Laws for Neural Language Models / Training Compute-Optimal LLMs (Chinchilla)
LLM Infrastructure — where this paper's cost terms become your pager

Attention Is All You Need: The Transformer ​

Paper Overview ​

TL;DR ​

What It Replaced, and Why That Mattered to Hardware ​

The Mechanism in One Pass ​

Why This Paper Is a Systems Paper in 2026 ​

Influence on System Design ​

References ​