*Published on SynaiTech Blog | Category: AI Technical Deep-Dive*

Introduction

GPT—Generative Pre-trained Transformer—has become synonymous with the AI revolution. From ChatGPT’s viral launch to GPT-4’s multimodal capabilities, this architecture has defined a new era in artificial intelligence. Yet despite its ubiquity, most discussions of GPT remain superficial, treating it as a black box that magically produces human-like text.

This comprehensive technical deep-dive explains how GPT actually works—from the fundamental mathematics of attention to the engineering that enables models with hundreds of billions of parameters. Whether you’re a machine learning engineer seeking deeper understanding, a developer building on GPT APIs, or a technical leader making AI decisions, this article provides the detailed understanding necessary for informed work in the age of large language models.

GPT in Context

The Transformer Lineage

GPT builds on the Transformer architecture introduced in 2017:

Key Predecessors:

  • Attention mechanisms in sequence-to-sequence models
  • The original “Attention Is All You Need” paper
  • BERT (bidirectional encoder representations)
  • Various encoder-decoder architectures

GPT’s Distinctive Approach:

  • Decoder-only architecture (no encoder)
  • Unidirectional (causal) attention
  • Next-token prediction objective
  • Large-scale pretraining

Evolution:

  • GPT-1 (2018): 117M parameters, proof of concept
  • GPT-2 (2019): 1.5B parameters, impressive generation
  • GPT-3 (2020): 175B parameters, few-shot learning
  • GPT-4 (2023): Multimodal, significant capability jump
  • GPT-4o/4.5 (2024-2025): Continued improvements

Core Design Principles

Autoregressive Generation:

GPT predicts the next token given all previous tokens:

P(x_1, x_2, ..., x_n) = ∏ P(x_i | x_1, ..., x_{i-1})

`

Unidirectional Attention:

Each position can only attend to earlier positions—it cannot "see" future tokens. This enables generation and prevents information leakage.

Scale as Capability:

GPT's power comes largely from scale: more parameters, more data, more compute. This "scaling hypothesis" has been validated repeatedly.

Tokenization: From Text to Numbers

Why Tokenization Matters

Neural networks operate on numbers, not text. Tokenization bridges this gap.

Challenges:

  • Vocabulary size tradeoffs (large = better representation, slow training)
  • Out-of-vocabulary handling
  • Multilingual support
  • Efficiency for generation

Byte Pair Encoding (BPE)

GPT uses BPE tokenization:

Algorithm:

  1. Start with character-level vocabulary
  2. Count all adjacent pairs in training corpus
  3. Merge most frequent pair into new token
  4. Repeat until desired vocabulary size

Example:

`

Starting: ['l', 'o', 'w', 'e', 'r', 'n', 'w', 's', 't']

After merging 'l'+'o': ['lo', 'w', 'e', 'r', 'n', 'w', 's', 't', 'lo']

After merging 'lo'+'w': ['low', 'e', 'r', 'n', 'w', 's', 't', 'lo', 'low']

Continue...

`

Properties:

  • Common words become single tokens: "the" → [THE]
  • Rare words split into subwords: "tokenization" → [token, ization]
  • Any text can be encoded (no OOV)
  • Typical vocabulary: 50,000-100,000 tokens

Tokenization in Practice

GPT-4 Tokenization:

`

"Hello, world!" → [15496, 11, 995, 0]

"Tokenization is fascinating" → [10389, 2065, 338, 27594, 310]

`

Considerations:

  • Tokenization affects context length (tokens ≠ words)
  • Some languages tokenize more efficiently
  • Numbers and code have specific tokenization patterns
  • Whitespace handling varies

The Embedding Layer

Token Embeddings

Each token maps to a learned vector:

`python

class TokenEmbedding(nn.Module):

def __init__(self, vocab_size, d_model):

super().__init__()

self.embedding = nn.Embedding(vocab_size, d_model)

def forward(self, token_ids):

# token_ids: (batch_size, seq_len)

return self.embedding(token_ids)

# Output: (batch_size, seq_len, d_model)

`

Dimensionality:

  • GPT-3: d_model = 12,288
  • GPT-4: Estimated d_model = 16,384 or higher
  • Higher dimensions capture more nuance

Positional Encoding

Transformers have no inherent notion of order. Position must be added.

Learned Position Embeddings (GPT-2/3):

`python

class PositionEmbedding(nn.Module):

def __init__(self, max_seq_len, d_model):

super().__init__()

self.position_embedding = nn.Embedding(max_seq_len, d_model)

def forward(self, seq_len):

positions = torch.arange(seq_len)

return self.position_embedding(positions)

`

Combined Input:

`python

def prepare_input(token_ids):

token_emb = token_embedding(token_ids)

pos_emb = position_embedding(token_ids.shape[1])

return token_emb + pos_emb

`

Modern Alternatives:

  • RoPE (Rotary Position Embeddings): Used in newer models
  • ALiBi (Attention with Linear Biases): Alternative approach
  • These enable better length generalization

Self-Attention: The Core Mechanism

Intuition for Attention

Attention answers: "When processing this token, which other tokens are relevant?"

Consider: "The cat sat on the mat because it was tired."

When processing "it":

  • Should attend strongly to "cat" (what "it" refers to)
  • Might attend to "sat" (what action "it" did)
  • Should attend weakly to "mat" (not the referent)

Attention learns these patterns from data.

Query, Key, Value Framework

Each token has three representations:

Query (Q): What am I looking for?

Key (K): What do I contain?

Value (V): What information do I pass forward?

`python

class AttentionQKV(nn.Module):

def __init__(self, d_model, d_k):

super().__init__()

self.W_q = nn.Linear(d_model, d_k)

self.W_k = nn.Linear(d_model, d_k)

self.W_v = nn.Linear(d_model, d_k)

def forward(self, x):

Q = self.W_q(x) # (batch, seq, d_k)

K = self.W_k(x) # (batch, seq, d_k)

V = self.W_v(x) # (batch, seq, d_k)

return Q, K, V

`

Scaled Dot-Product Attention

The attention calculation:

`

Attention(Q, K, V) = softmax(QK^T / √d_k) × V

`

Step by step:

  1. Compute attention scores:

`python

scores = Q @ K.transpose(-2, -1) # (batch, seq, seq)

`

Each entry (i, j) measures how much position i should attend to position j.

  1. Scale:

`python

scores = scores / math.sqrt(d_k)

`

Prevents softmax saturation for large d_k.

  1. Apply causal mask (GPT-specific):

`python

causal_mask = torch.triu(torch.ones(seq, seq), diagonal=1).bool()

scores.masked_fill_(causal_mask, float('-inf'))

`

Positions cannot attend to future tokens.

  1. Softmax:

`python

attention_weights = F.softmax(scores, dim=-1) # (batch, seq, seq)

`

Each row sums to 1—a probability distribution over positions.

  1. Weighted sum:

`python

output = attention_weights @ V # (batch, seq, d_k)

`

Each position becomes a weighted combination of value vectors.

Multi-Head Attention

Single attention may be limiting. Multiple "heads" allow different attention patterns:

`python

class MultiHeadAttention(nn.Module):

def __init__(self, d_model, num_heads):

super().__init__()

assert d_model % num_heads == 0

self.num_heads = num_heads

self.d_k = d_model // num_heads

self.W_q = nn.Linear(d_model, d_model)

self.W_k = nn.Linear(d_model, d_model)

self.W_v = nn.Linear(d_model, d_model)

self.W_o = nn.Linear(d_model, d_model)

def forward(self, x, mask=None):

batch, seq_len, d_model = x.shape

# Project to Q, K, V

Q = self.W_q(x)

K = self.W_k(x)

V = self.W_v(x)

# Reshape to (batch, num_heads, seq_len, d_k)

Q = Q.view(batch, seq_len, self.num_heads, self.d_k).transpose(1, 2)

K = K.view(batch, seq_len, self.num_heads, self.d_k).transpose(1, 2)

V = V.view(batch, seq_len, self.num_heads, self.d_k).transpose(1, 2)

# Attention scores

scores = (Q @ K.transpose(-2, -1)) / math.sqrt(self.d_k)

if mask is not None:

scores.masked_fill_(mask, float('-inf'))

attention = F.softmax(scores, dim=-1)

# Apply to values

context = attention @ V # (batch, heads, seq, d_k)

# Concatenate heads

context = context.transpose(1, 2).contiguous().view(batch, seq_len, d_model)

# Final projection

return self.W_o(context)

`

Typical Configurations:

  • GPT-3: 96 heads, d_k = 128
  • Each head can learn different patterns (syntax, semantics, coreference)

Feed-Forward Networks

Purpose

After attention aggregates information across positions, the feed-forward network (FFN) processes each position independently:

`python

class FeedForward(nn.Module):

def __init__(self, d_model, d_ff, dropout=0.1):

super().__init__()

self.linear1 = nn.Linear(d_model, d_ff)

self.linear2 = nn.Linear(d_ff, d_model)

self.dropout = nn.Dropout(dropout)

def forward(self, x):

x = self.linear1(x)

x = F.gelu(x) # GPT uses GELU activation

x = self.dropout(x)

x = self.linear2(x)

return x

`

Expansion Factor

The hidden dimension d_ff is typically 4× the model dimension:

  • GPT-3: d_model = 12,288, d_ff = 49,152
  • This expansion provides representational capacity

FFN as Memory

Research suggests FFN layers store factual knowledge:

  • First layer acts as "key" matcher
  • Second layer retrieves associated "value"
  • Analogous to key-value memory

The Complete Transformer Block

Layer Normalization

GPT uses layer normalization to stabilize training:

`python

class LayerNorm(nn.Module):

def __init__(self, d_model, eps=1e-5):

super().__init__()

self.gamma = nn.Parameter(torch.ones(d_model))

self.beta = nn.Parameter(torch.zeros(d_model))

self.eps = eps

def forward(self, x):

mean = x.mean(dim=-1, keepdim=True)

std = x.std(dim=-1, keepdim=True)

return self.gamma * (x - mean) / (std + self.eps) + self.beta

`

Residual Connections

Skip connections add input to output of each sublayer:

`python

output = layer(x) + x # Residual connection

`

Benefits:

  • Enables training of very deep networks
  • Provides gradient highways
  • Allows layers to learn residual functions

Pre-Norm vs Post-Norm

Post-Norm (Original Transformer):

`python

x = self.norm(x + self.attention(x))

x = self.norm(x + self.ffn(x))

`

Pre-Norm (GPT-3 and most modern LLMs):

`python

x = x + self.attention(self.norm(x))

x = x + self.ffn(self.norm(x))

`

Pre-norm is more stable for very deep networks.

Complete Block

`python

class TransformerBlock(nn.Module):

def __init__(self, d_model, num_heads, d_ff, dropout=0.1):

super().__init__()

self.attention = MultiHeadAttention(d_model, num_heads)

self.ffn = FeedForward(d_model, d_ff, dropout)

self.norm1 = LayerNorm(d_model)

self.norm2 = LayerNorm(d_model)

self.dropout = nn.Dropout(dropout)

def forward(self, x, mask=None):

# Pre-norm self-attention

attn_out = self.attention(self.norm1(x), mask)

x = x + self.dropout(attn_out)

# Pre-norm feed-forward

ffn_out = self.ffn(self.norm2(x))

x = x + self.dropout(ffn_out)

return x

`

Stacking It All Together

The Full GPT Model

`python

class GPT(nn.Module):

def __init__(self, vocab_size, d_model, num_heads, num_layers,

max_seq_len, d_ff, dropout=0.1):

super().__init__()

# Embeddings

self.token_embedding = nn.Embedding(vocab_size, d_model)

self.position_embedding = nn.Embedding(max_seq_len, d_model)

self.dropout = nn.Dropout(dropout)

# Transformer blocks

self.blocks = nn.ModuleList([

TransformerBlock(d_model, num_heads, d_ff, dropout)

for _ in range(num_layers)

])

# Final layer norm

self.final_norm = LayerNorm(d_model)

# Output projection (often tied to token embedding)

self.output_projection = nn.Linear(d_model, vocab_size, bias=False)

self.output_projection.weight = self.token_embedding.weight # Weight tying

def forward(self, token_ids):

batch_size, seq_len = token_ids.shape

# Embeddings

positions = torch.arange(seq_len, device=token_ids.device)

x = self.token_embedding(token_ids) + self.position_embedding(positions)

x = self.dropout(x)

# Causal mask

mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()

mask = mask.to(token_ids.device)

# Through all transformer blocks

for block in self.blocks:

x = block(x, mask)

# Final norm and output

x = self.final_norm(x)

logits = self.output_projection(x) # (batch, seq_len, vocab_size)

return logits

`

Model Sizes

GPT-3 Configuration:

  • Parameters: 175 billion
  • d_model: 12,288
  • num_heads: 96
  • num_layers: 96
  • d_ff: 49,152
  • max_seq_len: 2,048
  • vocab_size: 50,257

GPT-4 (Estimated):

  • Parameters: ~1.8 trillion (rumored)
  • Mixture of Experts architecture
  • Longer context window
  • Multimodal capabilities

Training GPT

Pretraining Objective

Next-token prediction (causal language modeling):

`python

def compute_loss(model, batch):

token_ids = batch # (batch_size, seq_len)

# Input: all tokens except last

inputs = token_ids[:, :-1]

# Target: all tokens except first

targets = token_ids[:, 1:]

# Forward pass

logits = model(inputs)

# Cross-entropy loss

loss = F.cross_entropy(

logits.view(-1, vocab_size),

targets.view(-1)

)

return loss

`

Training Data

GPT-3 Training Data:

  • Common Crawl (filtered): 410B tokens
  • WebText2: 19B tokens
  • Books1 & Books2: 67B tokens
  • Wikipedia: 3B tokens
  • Total: ~500B tokens

Training Details:

  • Single training pass over data
  • Careful deduplication
  • Quality filtering crucial
  • Extensive preprocessing

Training at Scale

Distributed Training:

  • Data parallelism: Same model on multiple GPUs, different data
  • Model parallelism: Model split across GPUs
  • Pipeline parallelism: Layers on different GPUs
  • Tensor parallelism: Operations split across GPUs

Optimization:

  • Adam optimizer with specific hyperparameters
  • Learning rate warmup then decay
  • Gradient clipping for stability
  • Mixed precision training (fp16/bf16)

Compute Requirements:

  • GPT-3: ~3,640 petaflop/s-days
  • GPT-4: Estimated 100× more
  • Thousands of GPUs for months

Inference and Generation

Next-Token Prediction

At inference, generate one token at a time:

`python

@torch.no_grad()

def generate(model, prompt_ids, max_new_tokens, temperature=1.0):

generated = prompt_ids.clone()

for _ in range(max_new_tokens):

# Get predictions for last position

logits = model(generated)[:, -1, :] # (batch, vocab)

# Apply temperature

logits = logits / temperature

# Sample

probs = F.softmax(logits, dim=-1)

next_token = torch.multinomial(probs, num_samples=1)

# Append

generated = torch.cat([generated, next_token], dim=1)

return generated

`

Sampling Strategies

Temperature:

  • Lower (0.1-0.5): More deterministic, focused
  • Higher (0.8-1.2): More diverse, creative
  • Temperature = 0: Greedy (argmax)

Top-k Sampling:

Only consider the k most likely tokens:

`python

top_k_values, top_k_indices = torch.topk(logits, k)

probs = F.softmax(top_k_values, dim=-1)

idx = torch.multinomial(probs, 1)

next_token = top_k_indices[idx]

`

Top-p (Nucleus) Sampling:

Consider tokens until cumulative probability reaches p:

`python

sorted_probs, sorted_indices = torch.sort(probs, descending=True)

cumulative = torch.cumsum(sorted_probs, dim=-1)

mask = cumulative > p

mask[..., 1:] = mask[..., :-1].clone()

mask[..., 0] = False

sorted_probs[mask] = 0

sorted_probs = sorted_probs / sorted_probs.sum()

`

KV Caching

Naive generation recomputes all attention for each new token. KV caching stores previous key-value computations:

`python

class CachedMultiHeadAttention(nn.Module):

def forward(self, x, past_kv=None):

Q = self.W_q(x)

K = self.W_k(x)

V = self.W_v(x)

if past_kv is not None:

past_K, past_V = past_kv

K = torch.cat([past_K, K], dim=1)

V = torch.cat([past_V, V], dim=1)

# Compute attention with full K, V

# ... attention computation ...

return output, (K, V) # Return updated cache

This reduces generation from O(n²) to O(n) per token.

Inside GPT: What the Model Learns

Attention Patterns

Research has identified interpretable attention patterns:

Syntactic Heads:

  • Subject-verb agreement
  • Modifier-noun relationships
  • Clause boundaries

Semantic Heads:

  • Coreference resolution
  • Named entity relationships
  • Topical connections

Position Heads:

  • Previous token attention
  • Fixed offset patterns
  • Beginning/end of sequence

Layer Specialization

Different layers perform different functions:

Early Layers:

  • Basic syntactic processing
  • Local relationships
  • Feature extraction

Middle Layers:

  • Semantic integration
  • Longer-range dependencies
  • Factual retrieval

Late Layers:

  • Task-specific processing
  • Output preparation
  • Next-token refinement

Emergent Capabilities

Large models exhibit capabilities not explicitly trained:

Few-Shot Learning:

Models can learn from examples in the prompt.

Chain of Thought:

Step-by-step reasoning improves accuracy.

In-Context Learning:

New capabilities from context alone.

These emerge from scale—smaller models lack them.

Conclusion

The GPT architecture, while conceptually elegant, achieves remarkable capabilities through the combination of self-attention, feed-forward networks, and massive scale. Each component serves a specific purpose: attention enables flexible information routing, FFN layers provide computational depth and knowledge storage, and residual connections enable training of very deep networks.

Understanding this architecture is increasingly important for AI practitioners. Whether you’re building applications on GPT APIs, fine-tuning models for specific tasks, or developing the next generation of architectures, grasping these fundamentals enables more effective work.

The key insights:

  • Self-attention allows each token to gather information from any other position
  • Causal masking enables autoregressive generation
  • Scale (parameters, data, compute) unlocks emergent capabilities
  • Engineering innovations (KV caching, efficient attention) make deployment practical

GPT has defined an era in AI. While future architectures will certainly improve upon it, the principles it established—attention-based processing, large-scale pretraining, and generative modeling—will likely remain influential for years to come.

*Found this technical deep-dive valuable? Subscribe to SynaiTech Blog for more explorations of AI architectures and technologies. From fundamentals to cutting-edge research, we help practitioners understand and build with modern AI. Join our community of engineers and researchers.*

Leave a Reply

Your email address will not be published. Required fields are marked *