The Architecture of GPT: Understanding How Large Language Models Actually Work

*Published on SynaiTech Blog | Category: AI Technical Deep-Dive*

Introduction

GPT—Generative Pre-trained Transformer—has become synonymous with the AI revolution. From ChatGPT’s viral launch to GPT-4’s multimodal capabilities, this architecture has defined a new era in artificial intelligence. Yet despite its ubiquity, most discussions of GPT remain superficial, treating it as a black box that magically produces human-like text.

This comprehensive technical deep-dive explains how GPT actually works—from the fundamental mathematics of attention to the engineering that enables models with hundreds of billions of parameters. Whether you’re a machine learning engineer seeking deeper understanding, a developer building on GPT APIs, or a technical leader making AI decisions, this article provides the detailed understanding necessary for informed work in the age of large language models.

GPT in Context

The Transformer Lineage

GPT builds on the Transformer architecture introduced in 2017:

Key Predecessors:

Attention mechanisms in sequence-to-sequence models
The original “Attention Is All You Need” paper
BERT (bidirectional encoder representations)
Various encoder-decoder architectures

GPT’s Distinctive Approach:

Decoder-only architecture (no encoder)
Unidirectional (causal) attention
Next-token prediction objective
Large-scale pretraining

Evolution:

GPT-1 (2018): 117M parameters, proof of concept
GPT-2 (2019): 1.5B parameters, impressive generation
GPT-3 (2020): 175B parameters, few-shot learning
GPT-4 (2023): Multimodal, significant capability jump
GPT-4o/4.5 (2024-2025): Continued improvements

Core Design Principles

Autoregressive Generation:

GPT predicts the next token given all previous tokens:

“


P(x_1, x_2, ..., x_n) = ∏ P(x_i | x_1, ..., x_{i-1})


Unidirectional Attention:
Each position can only attend to earlier positions—it cannot "see" future tokens. This enables generation and prevents information leakage.
Scale as Capability:
GPT's power comes largely from scale: more parameters, more data, more compute. This "scaling hypothesis" has been validated repeatedly.
Tokenization: From Text to Numbers
Why Tokenization Matters
Neural networks operate on numbers, not text. Tokenization bridges this gap.
Challenges:

Vocabulary size tradeoffs (large = better representation, slow training)
Out-of-vocabulary handling
Multilingual support
Efficiency for generation

Byte Pair Encoding (BPE)
GPT uses BPE tokenization:
Algorithm:

Start with character-level vocabulary
Count all adjacent pairs in training corpus
Merge most frequent pair into new token
Repeat until desired vocabulary size

Example:


Starting: ['l', 'o', 'w', 'e', 'r', 'n', 'w', 's', 't']
After merging 'l'+'o': ['lo', 'w', 'e', 'r', 'n', 'w', 's', 't', 'lo']
After merging 'lo'+'w': ['low', 'e', 'r', 'n', 'w', 's', 't', 'lo', 'low']
Continue...


Properties:

Common words become single tokens: "the" → [THE]
Rare words split into subwords: "tokenization" → [token, ization]
Any text can be encoded (no OOV)
Typical vocabulary: 50,000-100,000 tokens

Tokenization in Practice
GPT-4 Tokenization:


"Hello, world!" → [15496, 11, 995, 0]
"Tokenization is fascinating" → [10389, 2065, 338, 27594, 310]


Considerations:

Tokenization affects context length (tokens ≠ words)
Some languages tokenize more efficiently
Numbers and code have specific tokenization patterns
Whitespace handling varies

The Embedding Layer
Token Embeddings
Each token maps to a learned vector:

`python


class TokenEmbedding(nn.Module):
def __init__(self, vocab_size, d_model):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
def forward(self, token_ids):
# token_ids: (batch_size, seq_len)
return self.embedding(token_ids)
# Output: (batch_size, seq_len, d_model)


Dimensionality:

GPT-3: d_model = 12,288
GPT-4: Estimated d_model = 16,384 or higher
Higher dimensions capture more nuance

Positional Encoding
Transformers have no inherent notion of order. Position must be added.
Learned Position Embeddings (GPT-2/3):

`python


class PositionEmbedding(nn.Module):
def __init__(self, max_seq_len, d_model):
super().__init__()
self.position_embedding = nn.Embedding(max_seq_len, d_model)
def forward(self, seq_len):
positions = torch.arange(seq_len)
return self.position_embedding(positions)


Combined Input:

`python


def prepare_input(token_ids):
token_emb = token_embedding(token_ids)
pos_emb = position_embedding(token_ids.shape[1])
return token_emb + pos_emb


Modern Alternatives:

RoPE (Rotary Position Embeddings): Used in newer models
ALiBi (Attention with Linear Biases): Alternative approach
These enable better length generalization

Self-Attention: The Core Mechanism
Intuition for Attention
Attention answers: "When processing this token, which other tokens are relevant?"
Consider: "The cat sat on the mat because it was tired."
When processing "it":

Should attend strongly to "cat" (what "it" refers to)
Might attend to "sat" (what action "it" did)
Should attend weakly to "mat" (not the referent)

Attention learns these patterns from data.
Query, Key, Value Framework
Each token has three representations:
Query (Q): What am I looking for?
Key (K): What do I contain?
Value (V): What information do I pass forward?

`python


class AttentionQKV(nn.Module):
def __init__(self, d_model, d_k):
super().__init__()
self.W_q = nn.Linear(d_model, d_k)
self.W_k = nn.Linear(d_model, d_k)
self.W_v = nn.Linear(d_model, d_k)
def forward(self, x):
Q = self.W_q(x)  # (batch, seq, d_k)
K = self.W_k(x)  # (batch, seq, d_k)
V = self.W_v(x)  # (batch, seq, d_k)
return Q, K, V


Scaled Dot-Product Attention
The attention calculation:


Attention(Q, K, V) = softmax(QK^T / √d_k) × V


Step by step:

Compute attention scores:

`python


scores = Q @ K.transpose(-2, -1)  # (batch, seq, seq)


Each entry (i, j) measures how much position i should attend to position j.

Scale:

`python


scores = scores / math.sqrt(d_k)


Prevents softmax saturation for large d_k.

Apply causal mask (GPT-specific):

`python


causal_mask = torch.triu(torch.ones(seq, seq), diagonal=1).bool()
scores.masked_fill_(causal_mask, float('-inf'))


Positions cannot attend to future tokens.

Softmax:

`python


attention_weights = F.softmax(scores, dim=-1)  # (batch, seq, seq)


Each row sums to 1—a probability distribution over positions.

Weighted sum:

`python


output = attention_weights @ V  # (batch, seq, d_k)


Each position becomes a weighted combination of value vectors.
Multi-Head Attention
Single attention may be limiting. Multiple "heads" allow different attention patterns:

`python


class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
assert d_model % num_heads == 0
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x, mask=None):
batch, seq_len, d_model = x.shape
# Project to Q, K, V
Q = self.W_q(x)
K = self.W_k(x)
V = self.W_v(x)
# Reshape to (batch, num_heads, seq_len, d_k)
Q = Q.view(batch, seq_len, self.num_heads, self.d_k).transpose(1, 2)
K = K.view(batch, seq_len, self.num_heads, self.d_k).transpose(1, 2)
V = V.view(batch, seq_len, self.num_heads, self.d_k).transpose(1, 2)
# Attention scores
scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores.masked_fill_(mask, float('-inf'))
attention = F.softmax(scores, dim=-1)
# Apply to values
context = attention @ V  # (batch, heads, seq, d_k)
# Concatenate heads
context = context.transpose(1, 2).contiguous().view(batch, seq_len, d_model)
# Final projection
return self.W_o(context)


Typical Configurations:

GPT-3: 96 heads, d_k = 128
Each head can learn different patterns (syntax, semantics, coreference)

Feed-Forward Networks
Purpose
After attention aggregates information across positions, the feed-forward network (FFN) processes each position independently:

`python


class FeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
x = self.linear1(x)
x = F.gelu(x)  # GPT uses GELU activation
x = self.dropout(x)
x = self.linear2(x)
return x


Expansion Factor
The hidden dimension d_ff is typically 4× the model dimension:

GPT-3: d_model = 12,288, d_ff = 49,152
This expansion provides representational capacity

FFN as Memory
Research suggests FFN layers store factual knowledge:

First layer acts as "key" matcher
Second layer retrieves associated "value"
Analogous to key-value memory

The Complete Transformer Block
Layer Normalization
GPT uses layer normalization to stabilize training:

`python


class LayerNorm(nn.Module):
def __init__(self, d_model, eps=1e-5):
super().__init__()
self.gamma = nn.Parameter(torch.ones(d_model))
self.beta = nn.Parameter(torch.zeros(d_model))
self.eps = eps
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
std = x.std(dim=-1, keepdim=True)
return self.gamma * (x - mean) / (std + self.eps) + self.beta


Residual Connections
Skip connections add input to output of each sublayer:

`python


output = layer(x) + x  # Residual connection


Benefits:

Enables training of very deep networks
Provides gradient highways
Allows layers to learn residual functions

Pre-Norm vs Post-Norm
Post-Norm (Original Transformer):

`python


x = self.norm(x + self.attention(x))
x = self.norm(x + self.ffn(x))


Pre-Norm (GPT-3 and most modern LLMs):

`python


x = x + self.attention(self.norm(x))
x = x + self.ffn(self.norm(x))


Pre-norm is more stable for very deep networks.
Complete Block

`python


class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads)
self.ffn = FeedForward(d_model, d_ff, dropout)
self.norm1 = LayerNorm(d_model)
self.norm2 = LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Pre-norm self-attention
attn_out = self.attention(self.norm1(x), mask)
x = x + self.dropout(attn_out)
# Pre-norm feed-forward
ffn_out = self.ffn(self.norm2(x))
x = x + self.dropout(ffn_out)
return x


Stacking It All Together
The Full GPT Model

`python


class GPT(nn.Module):
def __init__(self, vocab_size, d_model, num_heads, num_layers,
max_seq_len, d_ff, dropout=0.1):
super().__init__()
# Embeddings
self.token_embedding = nn.Embedding(vocab_size, d_model)
self.position_embedding = nn.Embedding(max_seq_len, d_model)
self.dropout = nn.Dropout(dropout)
# Transformer blocks
self.blocks = nn.ModuleList([
TransformerBlock(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
# Final layer norm
self.final_norm = LayerNorm(d_model)
# Output projection (often tied to token embedding)
self.output_projection = nn.Linear(d_model, vocab_size, bias=False)
self.output_projection.weight = self.token_embedding.weight  # Weight tying
def forward(self, token_ids):
batch_size, seq_len = token_ids.shape
# Embeddings
positions = torch.arange(seq_len, device=token_ids.device)
x = self.token_embedding(token_ids) + self.position_embedding(positions)
x = self.dropout(x)
# Causal mask
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
mask = mask.to(token_ids.device)
# Through all transformer blocks
for block in self.blocks:
x = block(x, mask)
# Final norm and output
x = self.final_norm(x)
logits = self.output_projection(x)  # (batch, seq_len, vocab_size)
return logits


Model Sizes
GPT-3 Configuration:

Parameters: 175 billion
d_model: 12,288
num_heads: 96
num_layers: 96
d_ff: 49,152
max_seq_len: 2,048
vocab_size: 50,257

GPT-4 (Estimated):

Parameters: ~1.8 trillion (rumored)
Mixture of Experts architecture
Longer context window
Multimodal capabilities

Training GPT
Pretraining Objective
Next-token prediction (causal language modeling):

`python


def compute_loss(model, batch):
token_ids = batch  # (batch_size, seq_len)
# Input: all tokens except last
inputs = token_ids[:, :-1]
# Target: all tokens except first
targets = token_ids[:, 1:]
# Forward pass
logits = model(inputs)
# Cross-entropy loss
loss = F.cross_entropy(
logits.view(-1, vocab_size),
targets.view(-1)
)
return loss


Training Data
GPT-3 Training Data:

Common Crawl (filtered): 410B tokens
WebText2: 19B tokens
Books1 & Books2: 67B tokens
Wikipedia: 3B tokens
Total: ~500B tokens

Training Details:

Single training pass over data
Careful deduplication
Quality filtering crucial
Extensive preprocessing

Training at Scale
Distributed Training:

Data parallelism: Same model on multiple GPUs, different data
Model parallelism: Model split across GPUs
Pipeline parallelism: Layers on different GPUs
Tensor parallelism: Operations split across GPUs

Optimization:

Adam optimizer with specific hyperparameters
Learning rate warmup then decay
Gradient clipping for stability
Mixed precision training (fp16/bf16)

Compute Requirements:

GPT-3: ~3,640 petaflop/s-days
GPT-4: Estimated 100× more
Thousands of GPUs for months

Inference and Generation
Next-Token Prediction
At inference, generate one token at a time:

`python


@torch.no_grad()
def generate(model, prompt_ids, max_new_tokens, temperature=1.0):
generated = prompt_ids.clone()
for _ in range(max_new_tokens):
# Get predictions for last position
logits = model(generated)[:, -1, :]  # (batch, vocab)
# Apply temperature
logits = logits / temperature
# Sample
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
# Append
generated = torch.cat([generated, next_token], dim=1)
return generated


Sampling Strategies
Temperature:

Lower (0.1-0.5): More deterministic, focused
Higher (0.8-1.2): More diverse, creative
Temperature = 0: Greedy (argmax)

Top-k Sampling:
Only consider the k most likely tokens:

`python


top_k_values, top_k_indices = torch.topk(logits, k)
probs = F.softmax(top_k_values, dim=-1)
idx = torch.multinomial(probs, 1)
next_token = top_k_indices[idx]


Top-p (Nucleus) Sampling:
Consider tokens until cumulative probability reaches p:

`python


sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumulative = torch.cumsum(sorted_probs, dim=-1)
mask = cumulative > p
mask[..., 1:] = mask[..., :-1].clone()
mask[..., 0] = False
sorted_probs[mask] = 0
sorted_probs = sorted_probs / sorted_probs.sum()


KV Caching
Naive generation recomputes all attention for each new token. KV caching stores previous key-value computations:

`python


class CachedMultiHeadAttention(nn.Module):
def forward(self, x, past_kv=None):
Q = self.W_q(x)
K = self.W_k(x)
V = self.W_v(x)
if past_kv is not None:
past_K, past_V = past_kv
K = torch.cat([past_K, K], dim=1)
V = torch.cat([past_V, V], dim=1)
# Compute attention with full K, V
# ... attention computation ...
return output, (K, V)  # Return updated cache

“

This reduces generation from O(n²) to O(n) per token.

Inside GPT: What the Model Learns

Attention Patterns

Research has identified interpretable attention patterns:

Syntactic Heads:

Subject-verb agreement
Modifier-noun relationships
Clause boundaries

Semantic Heads:

Coreference resolution
Named entity relationships
Topical connections

Position Heads:

Previous token attention
Fixed offset patterns
Beginning/end of sequence

Layer Specialization

Different layers perform different functions:

Early Layers:

Basic syntactic processing
Local relationships
Feature extraction

Middle Layers:

Semantic integration
Longer-range dependencies
Factual retrieval

Late Layers:

Task-specific processing
Output preparation
Next-token refinement

Emergent Capabilities

Large models exhibit capabilities not explicitly trained:

Few-Shot Learning:

Models can learn from examples in the prompt.

Chain of Thought:

Step-by-step reasoning improves accuracy.

In-Context Learning:

New capabilities from context alone.

These emerge from scale—smaller models lack them.

Conclusion

The GPT architecture, while conceptually elegant, achieves remarkable capabilities through the combination of self-attention, feed-forward networks, and massive scale. Each component serves a specific purpose: attention enables flexible information routing, FFN layers provide computational depth and knowledge storage, and residual connections enable training of very deep networks.

Understanding this architecture is increasingly important for AI practitioners. Whether you’re building applications on GPT APIs, fine-tuning models for specific tasks, or developing the next generation of architectures, grasping these fundamentals enables more effective work.

The key insights:

Self-attention allows each token to gather information from any other position
Causal masking enables autoregressive generation
Scale (parameters, data, compute) unlocks emergent capabilities
Engineering innovations (KV caching, efficient attention) make deployment practical

GPT has defined an era in AI. While future architectures will certainly improve upon it, the principles it established—attention-based processing, large-scale pretraining, and generative modeling—will likely remain influential for years to come.

—

*Found this technical deep-dive valuable? Subscribe to SynaiTech Blog for more explorations of AI architectures and technologies. From fundamentals to cutting-edge research, we help practitioners understand and build with modern AI. Join our community of engineers and researchers.*