*Published on SynaiTech Blog | Category: AI Technical Deep-Dive*
Introduction
GPT—Generative Pre-trained Transformer—has become synonymous with the AI revolution. From ChatGPT’s viral launch to GPT-4’s multimodal capabilities, this architecture has defined a new era in artificial intelligence. Yet despite its ubiquity, most discussions of GPT remain superficial, treating it as a black box that magically produces human-like text.
This comprehensive technical deep-dive explains how GPT actually works—from the fundamental mathematics of attention to the engineering that enables models with hundreds of billions of parameters. Whether you’re a machine learning engineer seeking deeper understanding, a developer building on GPT APIs, or a technical leader making AI decisions, this article provides the detailed understanding necessary for informed work in the age of large language models.
GPT in Context
The Transformer Lineage
GPT builds on the Transformer architecture introduced in 2017:
Key Predecessors:
- Attention mechanisms in sequence-to-sequence models
- The original “Attention Is All You Need” paper
- BERT (bidirectional encoder representations)
- Various encoder-decoder architectures
GPT’s Distinctive Approach:
- Decoder-only architecture (no encoder)
- Unidirectional (causal) attention
- Next-token prediction objective
- Large-scale pretraining
Evolution:
- GPT-1 (2018): 117M parameters, proof of concept
- GPT-2 (2019): 1.5B parameters, impressive generation
- GPT-3 (2020): 175B parameters, few-shot learning
- GPT-4 (2023): Multimodal, significant capability jump
- GPT-4o/4.5 (2024-2025): Continued improvements
Core Design Principles
Autoregressive Generation:
GPT predicts the next token given all previous tokens:
“
P(x_1, x_2, ..., x_n) = ∏ P(x_i | x_1, ..., x_{i-1})
`
Unidirectional Attention:
Each position can only attend to earlier positions—it cannot "see" future tokens. This enables generation and prevents information leakage.
Scale as Capability:
GPT's power comes largely from scale: more parameters, more data, more compute. This "scaling hypothesis" has been validated repeatedly.
Tokenization: From Text to Numbers
Why Tokenization Matters
Neural networks operate on numbers, not text. Tokenization bridges this gap.
Challenges:
- Vocabulary size tradeoffs (large = better representation, slow training)
- Out-of-vocabulary handling
- Multilingual support
- Efficiency for generation
Byte Pair Encoding (BPE)
GPT uses BPE tokenization:
Algorithm:
- Start with character-level vocabulary
- Count all adjacent pairs in training corpus
- Merge most frequent pair into new token
- Repeat until desired vocabulary size
Example:
`
Starting: ['l', 'o', 'w', 'e', 'r', 'n', 'w', 's', 't']
After merging 'l'+'o': ['lo', 'w', 'e', 'r', 'n', 'w', 's', 't', 'lo']
After merging 'lo'+'w': ['low', 'e', 'r', 'n', 'w', 's', 't', 'lo', 'low']
Continue...
`
Properties:
- Common words become single tokens: "the" → [THE]
- Rare words split into subwords: "tokenization" → [token, ization]
- Any text can be encoded (no OOV)
- Typical vocabulary: 50,000-100,000 tokens
Tokenization in Practice
GPT-4 Tokenization:
`
"Hello, world!" → [15496, 11, 995, 0]
"Tokenization is fascinating" → [10389, 2065, 338, 27594, 310]
`
Considerations:
- Tokenization affects context length (tokens ≠ words)
- Some languages tokenize more efficiently
- Numbers and code have specific tokenization patterns
- Whitespace handling varies
The Embedding Layer
Token Embeddings
Each token maps to a learned vector:
`python
class TokenEmbedding(nn.Module):
def __init__(self, vocab_size, d_model):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
def forward(self, token_ids):
# token_ids: (batch_size, seq_len)
return self.embedding(token_ids)
# Output: (batch_size, seq_len, d_model)
`
Dimensionality:
- GPT-3: d_model = 12,288
- GPT-4: Estimated d_model = 16,384 or higher
- Higher dimensions capture more nuance
Positional Encoding
Transformers have no inherent notion of order. Position must be added.
Learned Position Embeddings (GPT-2/3):
`python
class PositionEmbedding(nn.Module):
def __init__(self, max_seq_len, d_model):
super().__init__()
self.position_embedding = nn.Embedding(max_seq_len, d_model)
def forward(self, seq_len):
positions = torch.arange(seq_len)
return self.position_embedding(positions)
`
Combined Input:
`python
def prepare_input(token_ids):
token_emb = token_embedding(token_ids)
pos_emb = position_embedding(token_ids.shape[1])
return token_emb + pos_emb
`
Modern Alternatives:
- RoPE (Rotary Position Embeddings): Used in newer models
- ALiBi (Attention with Linear Biases): Alternative approach
- These enable better length generalization
Self-Attention: The Core Mechanism
Intuition for Attention
Attention answers: "When processing this token, which other tokens are relevant?"
Consider: "The cat sat on the mat because it was tired."
When processing "it":
- Should attend strongly to "cat" (what "it" refers to)
- Might attend to "sat" (what action "it" did)
- Should attend weakly to "mat" (not the referent)
Attention learns these patterns from data.
Query, Key, Value Framework
Each token has three representations:
Query (Q): What am I looking for?
Key (K): What do I contain?
Value (V): What information do I pass forward?
`python
class AttentionQKV(nn.Module):
def __init__(self, d_model, d_k):
super().__init__()
self.W_q = nn.Linear(d_model, d_k)
self.W_k = nn.Linear(d_model, d_k)
self.W_v = nn.Linear(d_model, d_k)
def forward(self, x):
Q = self.W_q(x) # (batch, seq, d_k)
K = self.W_k(x) # (batch, seq, d_k)
V = self.W_v(x) # (batch, seq, d_k)
return Q, K, V
`
Scaled Dot-Product Attention
The attention calculation:
`
Attention(Q, K, V) = softmax(QK^T / √d_k) × V
`
Step by step:
- Compute attention scores:
`python
scores = Q @ K.transpose(-2, -1) # (batch, seq, seq)
`
Each entry (i, j) measures how much position i should attend to position j.
- Scale:
`python
scores = scores / math.sqrt(d_k)
`
Prevents softmax saturation for large d_k.
- Apply causal mask (GPT-specific):
`python
causal_mask = torch.triu(torch.ones(seq, seq), diagonal=1).bool()
scores.masked_fill_(causal_mask, float('-inf'))
`
Positions cannot attend to future tokens.
- Softmax:
`python
attention_weights = F.softmax(scores, dim=-1) # (batch, seq, seq)
`
Each row sums to 1—a probability distribution over positions.
- Weighted sum:
`python
output = attention_weights @ V # (batch, seq, d_k)
`
Each position becomes a weighted combination of value vectors.
Multi-Head Attention
Single attention may be limiting. Multiple "heads" allow different attention patterns:
`python
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
assert d_model % num_heads == 0
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.W_q = nn.Linear(d_model, d_model)
self.W_k = nn.Linear(d_model, d_model)
self.W_v = nn.Linear(d_model, d_model)
self.W_o = nn.Linear(d_model, d_model)
def forward(self, x, mask=None):
batch, seq_len, d_model = x.shape
# Project to Q, K, V
Q = self.W_q(x)
K = self.W_k(x)
V = self.W_v(x)
# Reshape to (batch, num_heads, seq_len, d_k)
Q = Q.view(batch, seq_len, self.num_heads, self.d_k).transpose(1, 2)
K = K.view(batch, seq_len, self.num_heads, self.d_k).transpose(1, 2)
V = V.view(batch, seq_len, self.num_heads, self.d_k).transpose(1, 2)
# Attention scores
scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores.masked_fill_(mask, float('-inf'))
attention = F.softmax(scores, dim=-1)
# Apply to values
context = attention @ V # (batch, heads, seq, d_k)
# Concatenate heads
context = context.transpose(1, 2).contiguous().view(batch, seq_len, d_model)
# Final projection
return self.W_o(context)
`
Typical Configurations:
- GPT-3: 96 heads, d_k = 128
- Each head can learn different patterns (syntax, semantics, coreference)
Feed-Forward Networks
Purpose
After attention aggregates information across positions, the feed-forward network (FFN) processes each position independently:
`python
class FeedForward(nn.Module):
def __init__(self, d_model, d_ff, dropout=0.1):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x):
x = self.linear1(x)
x = F.gelu(x) # GPT uses GELU activation
x = self.dropout(x)
x = self.linear2(x)
return x
`
Expansion Factor
The hidden dimension d_ff is typically 4× the model dimension:
- GPT-3: d_model = 12,288, d_ff = 49,152
- This expansion provides representational capacity
FFN as Memory
Research suggests FFN layers store factual knowledge:
- First layer acts as "key" matcher
- Second layer retrieves associated "value"
- Analogous to key-value memory
The Complete Transformer Block
Layer Normalization
GPT uses layer normalization to stabilize training:
`python
class LayerNorm(nn.Module):
def __init__(self, d_model, eps=1e-5):
super().__init__()
self.gamma = nn.Parameter(torch.ones(d_model))
self.beta = nn.Parameter(torch.zeros(d_model))
self.eps = eps
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
std = x.std(dim=-1, keepdim=True)
return self.gamma * (x - mean) / (std + self.eps) + self.beta
`
Residual Connections
Skip connections add input to output of each sublayer:
`python
output = layer(x) + x # Residual connection
`
Benefits:
- Enables training of very deep networks
- Provides gradient highways
- Allows layers to learn residual functions
Pre-Norm vs Post-Norm
Post-Norm (Original Transformer):
`python
x = self.norm(x + self.attention(x))
x = self.norm(x + self.ffn(x))
`
Pre-Norm (GPT-3 and most modern LLMs):
`python
x = x + self.attention(self.norm(x))
x = x + self.ffn(self.norm(x))
`
Pre-norm is more stable for very deep networks.
Complete Block
`python
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
self.attention = MultiHeadAttention(d_model, num_heads)
self.ffn = FeedForward(d_model, d_ff, dropout)
self.norm1 = LayerNorm(d_model)
self.norm2 = LayerNorm(d_model)
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Pre-norm self-attention
attn_out = self.attention(self.norm1(x), mask)
x = x + self.dropout(attn_out)
# Pre-norm feed-forward
ffn_out = self.ffn(self.norm2(x))
x = x + self.dropout(ffn_out)
return x
`
Stacking It All Together
The Full GPT Model
`python
class GPT(nn.Module):
def __init__(self, vocab_size, d_model, num_heads, num_layers,
max_seq_len, d_ff, dropout=0.1):
super().__init__()
# Embeddings
self.token_embedding = nn.Embedding(vocab_size, d_model)
self.position_embedding = nn.Embedding(max_seq_len, d_model)
self.dropout = nn.Dropout(dropout)
# Transformer blocks
self.blocks = nn.ModuleList([
TransformerBlock(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
# Final layer norm
self.final_norm = LayerNorm(d_model)
# Output projection (often tied to token embedding)
self.output_projection = nn.Linear(d_model, vocab_size, bias=False)
self.output_projection.weight = self.token_embedding.weight # Weight tying
def forward(self, token_ids):
batch_size, seq_len = token_ids.shape
# Embeddings
positions = torch.arange(seq_len, device=token_ids.device)
x = self.token_embedding(token_ids) + self.position_embedding(positions)
x = self.dropout(x)
# Causal mask
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
mask = mask.to(token_ids.device)
# Through all transformer blocks
for block in self.blocks:
x = block(x, mask)
# Final norm and output
x = self.final_norm(x)
logits = self.output_projection(x) # (batch, seq_len, vocab_size)
return logits
`
Model Sizes
GPT-3 Configuration:
- Parameters: 175 billion
- d_model: 12,288
- num_heads: 96
- num_layers: 96
- d_ff: 49,152
- max_seq_len: 2,048
- vocab_size: 50,257
GPT-4 (Estimated):
- Parameters: ~1.8 trillion (rumored)
- Mixture of Experts architecture
- Longer context window
- Multimodal capabilities
Training GPT
Pretraining Objective
Next-token prediction (causal language modeling):
`python
def compute_loss(model, batch):
token_ids = batch # (batch_size, seq_len)
# Input: all tokens except last
inputs = token_ids[:, :-1]
# Target: all tokens except first
targets = token_ids[:, 1:]
# Forward pass
logits = model(inputs)
# Cross-entropy loss
loss = F.cross_entropy(
logits.view(-1, vocab_size),
targets.view(-1)
)
return loss
`
Training Data
GPT-3 Training Data:
- Common Crawl (filtered): 410B tokens
- WebText2: 19B tokens
- Books1 & Books2: 67B tokens
- Wikipedia: 3B tokens
- Total: ~500B tokens
Training Details:
- Single training pass over data
- Careful deduplication
- Quality filtering crucial
- Extensive preprocessing
Training at Scale
Distributed Training:
- Data parallelism: Same model on multiple GPUs, different data
- Model parallelism: Model split across GPUs
- Pipeline parallelism: Layers on different GPUs
- Tensor parallelism: Operations split across GPUs
Optimization:
- Adam optimizer with specific hyperparameters
- Learning rate warmup then decay
- Gradient clipping for stability
- Mixed precision training (fp16/bf16)
Compute Requirements:
- GPT-3: ~3,640 petaflop/s-days
- GPT-4: Estimated 100× more
- Thousands of GPUs for months
Inference and Generation
Next-Token Prediction
At inference, generate one token at a time:
`python
@torch.no_grad()
def generate(model, prompt_ids, max_new_tokens, temperature=1.0):
generated = prompt_ids.clone()
for _ in range(max_new_tokens):
# Get predictions for last position
logits = model(generated)[:, -1, :] # (batch, vocab)
# Apply temperature
logits = logits / temperature
# Sample
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
# Append
generated = torch.cat([generated, next_token], dim=1)
return generated
`
Sampling Strategies
Temperature:
- Lower (0.1-0.5): More deterministic, focused
- Higher (0.8-1.2): More diverse, creative
- Temperature = 0: Greedy (argmax)
Top-k Sampling:
Only consider the k most likely tokens:
`python
top_k_values, top_k_indices = torch.topk(logits, k)
probs = F.softmax(top_k_values, dim=-1)
idx = torch.multinomial(probs, 1)
next_token = top_k_indices[idx]
`
Top-p (Nucleus) Sampling:
Consider tokens until cumulative probability reaches p:
`python
sorted_probs, sorted_indices = torch.sort(probs, descending=True)
cumulative = torch.cumsum(sorted_probs, dim=-1)
mask = cumulative > p
mask[..., 1:] = mask[..., :-1].clone()
mask[..., 0] = False
sorted_probs[mask] = 0
sorted_probs = sorted_probs / sorted_probs.sum()
`
KV Caching
Naive generation recomputes all attention for each new token. KV caching stores previous key-value computations:
`python
class CachedMultiHeadAttention(nn.Module):
def forward(self, x, past_kv=None):
Q = self.W_q(x)
K = self.W_k(x)
V = self.W_v(x)
if past_kv is not None:
past_K, past_V = past_kv
K = torch.cat([past_K, K], dim=1)
V = torch.cat([past_V, V], dim=1)
# Compute attention with full K, V
# ... attention computation ...
return output, (K, V) # Return updated cache
“
This reduces generation from O(n²) to O(n) per token.
Inside GPT: What the Model Learns
Attention Patterns
Research has identified interpretable attention patterns:
Syntactic Heads:
- Subject-verb agreement
- Modifier-noun relationships
- Clause boundaries
Semantic Heads:
- Coreference resolution
- Named entity relationships
- Topical connections
Position Heads:
- Previous token attention
- Fixed offset patterns
- Beginning/end of sequence
Layer Specialization
Different layers perform different functions:
Early Layers:
- Basic syntactic processing
- Local relationships
- Feature extraction
Middle Layers:
- Semantic integration
- Longer-range dependencies
- Factual retrieval
Late Layers:
- Task-specific processing
- Output preparation
- Next-token refinement
Emergent Capabilities
Large models exhibit capabilities not explicitly trained:
Few-Shot Learning:
Models can learn from examples in the prompt.
Chain of Thought:
Step-by-step reasoning improves accuracy.
In-Context Learning:
New capabilities from context alone.
These emerge from scale—smaller models lack them.
Conclusion
The GPT architecture, while conceptually elegant, achieves remarkable capabilities through the combination of self-attention, feed-forward networks, and massive scale. Each component serves a specific purpose: attention enables flexible information routing, FFN layers provide computational depth and knowledge storage, and residual connections enable training of very deep networks.
Understanding this architecture is increasingly important for AI practitioners. Whether you’re building applications on GPT APIs, fine-tuning models for specific tasks, or developing the next generation of architectures, grasping these fundamentals enables more effective work.
The key insights:
- Self-attention allows each token to gather information from any other position
- Causal masking enables autoregressive generation
- Scale (parameters, data, compute) unlocks emergent capabilities
- Engineering innovations (KV caching, efficient attention) make deployment practical
GPT has defined an era in AI. While future architectures will certainly improve upon it, the principles it established—attention-based processing, large-scale pretraining, and generative modeling—will likely remain influential for years to come.
—
*Found this technical deep-dive valuable? Subscribe to SynaiTech Blog for more explorations of AI architectures and technologies. From fundamentals to cutting-edge research, we help practitioners understand and build with modern AI. Join our community of engineers and researchers.*