Self-supervised learning has emerged as one of the most promising paradigms in artificial intelligence, fundamentally changing how we think about training machine learning models. By learning from unlabeled data, self-supervised methods have achieved remarkable results in natural language processing, computer vision, and beyond. This comprehensive guide explores the principles, techniques, and applications of self-supervised learning.
The Labeling Problem
Traditional supervised learning requires labeled datasets—examples paired with their correct answers. Creating these labels is expensive, time-consuming, and often requires domain expertise:
- Medical imaging: Expert radiologists must annotate each scan
- Natural language: Human annotators label sentiment, entities, or translations
- Autonomous driving: Thousands of hours of video need pixel-level annotations
Meanwhile, unlabeled data is abundant and cheap:
- Billions of images on the internet
- Trillions of words in text corpora
- Endless video footage and audio recordings
Self-supervised learning bridges this gap by extracting learning signals from the data itself.
What Is Self-Supervised Learning?
Self-supervised learning creates supervisory signals from the data’s own structure. The model solves “pretext tasks”—artificial tasks that don’t require human labels—to learn useful representations.
The key insight: solving these pretext tasks requires understanding the data’s underlying structure, which transfers to downstream tasks.
The Self-Supervised Learning Pipeline
- Pretext Task Design: Create a task that requires meaningful understanding
- Pretraining: Train a model on massive unlabeled data using the pretext task
- Transfer Learning: Fine-tune on downstream tasks with limited labels
Comparison with Other Paradigms
| Paradigm | Label Source | Data Required |
|———-|————-|—————|
| Supervised | Human annotations | Labeled data |
| Unsupervised | None (clustering, density) | Unlabeled data |
| Self-supervised | Data itself (automatic) | Unlabeled data |
| Semi-supervised | Few labels + unlabeled | Both |
Self-Supervised Learning in NLP
Natural language processing pioneered many self-supervised techniques, leveraging the sequential nature of text.
Language Modeling
The classic pretext task: predict the next word given previous words.
“python
# Causal language modeling example
text = "The cat sat on the"
target = "mat"
# Model learns: P(mat | The cat sat on the)
`
GPT and similar models use this approach:
`python
class CausalLanguageModel(nn.Module):
def __init__(self, vocab_size, d_model, nhead, num_layers):
super().__init__()
self.embedding = nn.Embedding(vocab_size, d_model)
self.transformer = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model, nhead),
num_layers
)
self.lm_head = nn.Linear(d_model, vocab_size)
def forward(self, x):
# Create causal mask
seq_len = x.size(1)
mask = torch.triu(torch.ones(seq_len, seq_len), diagonal=1).bool()
# Embed and process
x = self.embedding(x)
x = self.transformer(x, mask=mask)
return self.lm_head(x)
`
Masked Language Modeling (MLM)
BERT introduced masked language modeling: randomly mask tokens and predict them.
`python
def mask_tokens(inputs, tokenizer, mlm_probability=0.15):
"""Prepare masked tokens for masked language modeling."""
labels = inputs.clone()
# Create mask
probability_matrix = torch.full(labels.shape, mlm_probability)
masked_indices = torch.bernoulli(probability_matrix).bool()
labels[~masked_indices] = -100 # Don't compute loss for unmasked
# 80% of time, replace with [MASK]
indices_replaced = torch.bernoulli(
torch.full(labels.shape, 0.8)
).bool() & masked_indices
inputs[indices_replaced] = tokenizer.mask_token_id
# 10% of time, replace with random token
indices_random = torch.bernoulli(
torch.full(labels.shape, 0.5)
).bool() & masked_indices & ~indices_replaced
random_words = torch.randint(len(tokenizer), labels.shape)
inputs[indices_random] = random_words[indices_random]
# 10% of time, keep unchanged
return inputs, labels
`
Next Sentence Prediction (NSP)
BERT also trained on predicting whether two sentences are consecutive:
`python
# Positive example (consecutive sentences)
sentence_a = "The dog is happy."
sentence_b = "It wags its tail."
label = 1
# Negative example (random sentences)
sentence_a = "The dog is happy."
sentence_b = "The stock market crashed."
label = 0
`
Sentence Order Prediction
ALBERT replaced NSP with sentence order prediction—determining if two sentences are in the correct order:
`python
# Correct order
sentence_a = "First, preheat the oven."
sentence_b = "Then, place the dish inside."
label = 1
# Swapped order
sentence_a = "Then, place the dish inside."
sentence_b = "First, preheat the oven."
label = 0
`
Span Corruption (T5)
T5 uses span corruption: mask contiguous spans and reconstruct them:
`python
# Original: "The quick brown fox jumps over the lazy dog"
# Corrupted: "The
# Target: "
`
Self-Supervised Learning in Computer Vision
Visual self-supervision has developed unique approaches suited to image data.
Pretext Tasks for Vision
Rotation Prediction: Predict which rotation (0°, 90°, 180°, 270°) was applied:
`python
class RotationPredictor(nn.Module):
def __init__(self, encoder):
super().__init__()
self.encoder = encoder
self.classifier = nn.Linear(encoder.output_dim, 4)
def forward(self, x, rotation_labels):
# Rotate images
rotated = self.apply_rotations(x, rotation_labels)
# Predict rotation
features = self.encoder(rotated)
predictions = self.classifier(features)
return predictions
`
Jigsaw Puzzle: Divide image into patches, shuffle, predict permutation:
`python
def create_jigsaw_puzzle(image, num_patches=9):
"""Split image into patches and shuffle."""
patches = split_into_patches(image, int(num_patches**0.5))
# Create a random permutation
perm = torch.randperm(num_patches)
shuffled = patches[perm]
return shuffled, perm
`
Colorization: Convert to grayscale, predict colors:
`python
class ColorizationNetwork(nn.Module):
def __init__(self):
super().__init__()
self.encoder = ResNetEncoder()
self.decoder = UNetDecoder()
def forward(self, grayscale_image):
features = self.encoder(grayscale_image)
ab_channels = self.decoder(features) # Predict a, b in Lab color space
return ab_channels
`
Inpainting: Mask regions, predict missing content:
`python
def create_inpainting_task(image, mask_ratio=0.25):
"""Create masked image for inpainting."""
mask = create_random_mask(image.shape, mask_ratio)
masked_image = image * (1 - mask)
return masked_image, image, mask
`
Contrastive Learning
Contrastive learning has become the dominant paradigm for visual self-supervision. The core idea: learn representations where similar samples are close and dissimilar samples are far apart.
The Contrastive Learning Framework
- Create two augmented views of each image
- Encode both views
- Pull together representations of the same image (positives)
- Push apart representations of different images (negatives)
SimCLR
SimCLR (Simple Framework for Contrastive Learning) established a strong baseline:
`python
class SimCLR(nn.Module):
def __init__(self, encoder, projection_dim=128, temperature=0.5):
super().__init__()
self.encoder = encoder
self.projector = nn.Sequential(
nn.Linear(encoder.output_dim, encoder.output_dim),
nn.ReLU(),
nn.Linear(encoder.output_dim, projection_dim)
)
self.temperature = temperature
def forward(self, x1, x2):
# Encode both views
h1 = self.encoder(x1)
h2 = self.encoder(x2)
# Project to contrastive space
z1 = self.projector(h1)
z2 = self.projector(h2)
return z1, z2
def contrastive_loss(self, z1, z2):
batch_size = z1.size(0)
# Normalize
z1 = F.normalize(z1, dim=1)
z2 = F.normalize(z2, dim=1)
# Concatenate all projections
z = torch.cat([z1, z2], dim=0)
# Compute similarity matrix
sim = torch.mm(z, z.t()) / self.temperature
# Create labels (positive pairs)
labels = torch.arange(batch_size, device=z.device)
labels = torch.cat([labels + batch_size, labels])
# Mask self-similarity
mask = torch.eye(2 * batch_size, device=z.device).bool()
sim.masked_fill_(mask, -float('inf'))
# NT-Xent loss
loss = F.cross_entropy(sim, labels)
return loss
`
Data Augmentation in Contrastive Learning
Strong augmentations are crucial. SimCLR uses:
`python
contrastive_transforms = transforms.Compose([
transforms.RandomResizedCrop(224, scale=(0.2, 1.0)),
transforms.RandomHorizontalFlip(),
transforms.RandomApply([
transforms.ColorJitter(0.4, 0.4, 0.4, 0.1)
], p=0.8),
transforms.RandomGrayscale(p=0.2),
transforms.GaussianBlur(kernel_size=23),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
])
`
MoCo (Momentum Contrast)
MoCo addresses the need for large negative samples using a memory bank and momentum encoder:
`python
class MoCo(nn.Module):
def __init__(self, encoder, dim=128, K=65536, m=0.999, T=0.07):
super().__init__()
self.K = K # Queue size
self.m = m # Momentum coefficient
self.T = T # Temperature
# Query encoder
self.encoder_q = encoder
self.projector_q = nn.Sequential(
nn.Linear(encoder.output_dim, dim)
)
# Key encoder (momentum updated)
self.encoder_k = copy.deepcopy(encoder)
self.projector_k = copy.deepcopy(self.projector_q)
# Freeze key encoder
for param in self.encoder_k.parameters():
param.requires_grad = False
for param in self.projector_k.parameters():
param.requires_grad = False
# Queue of negative samples
self.register_buffer("queue", torch.randn(dim, K))
self.queue = F.normalize(self.queue, dim=0)
self.register_buffer("queue_ptr", torch.zeros(1, dtype=torch.long))
@torch.no_grad()
def momentum_update(self):
"""Update key encoder with momentum."""
for param_q, param_k in zip(
self.encoder_q.parameters(),
self.encoder_k.parameters()
):
param_k.data = param_k.data * self.m + param_q.data * (1 - self.m)
def forward(self, x_q, x_k):
# Query features
q = self.projector_q(self.encoder_q(x_q))
q = F.normalize(q, dim=1)
# Key features (no gradient)
with torch.no_grad():
self.momentum_update()
k = self.projector_k(self.encoder_k(x_k))
k = F.normalize(k, dim=1)
# Positive logits: Nx1
l_pos = torch.einsum('nc,nc->n', [q, k]).unsqueeze(-1)
# Negative logits: NxK
l_neg = torch.einsum('nc,ck->nk', [q, self.queue.clone().detach()])
# Logits: Nx(1+K)
logits = torch.cat([l_pos, l_neg], dim=1) / self.T
# Labels: positive is always first
labels = torch.zeros(logits.size(0), dtype=torch.long, device=logits.device)
# Update queue
self.update_queue(k)
return F.cross_entropy(logits, labels)
`
BYOL (Bootstrap Your Own Latent)
BYOL shows that negative samples aren't strictly necessary:
`python
class BYOL(nn.Module):
def __init__(self, encoder, projection_dim=256, hidden_dim=4096):
super().__init__()
# Online network
self.online_encoder = encoder
self.online_projector = nn.Sequential(
nn.Linear(encoder.output_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, projection_dim)
)
self.predictor = nn.Sequential(
nn.Linear(projection_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, projection_dim)
)
# Target network (momentum updated)
self.target_encoder = copy.deepcopy(encoder)
self.target_projector = copy.deepcopy(self.online_projector)
for param in self.target_encoder.parameters():
param.requires_grad = False
for param in self.target_projector.parameters():
param.requires_grad = False
def forward(self, x1, x2):
# Online predictions
online_proj_1 = self.online_projector(self.online_encoder(x1))
online_proj_2 = self.online_projector(self.online_encoder(x2))
online_pred_1 = self.predictor(online_proj_1)
online_pred_2 = self.predictor(online_proj_2)
# Target projections (no gradient)
with torch.no_grad():
target_proj_1 = self.target_projector(self.target_encoder(x1))
target_proj_2 = self.target_projector(self.target_encoder(x2))
# Symmetrized loss
loss = self.regression_loss(online_pred_1, target_proj_2.detach())
loss += self.regression_loss(online_pred_2, target_proj_1.detach())
return loss.mean()
def regression_loss(self, x, y):
x = F.normalize(x, dim=-1)
y = F.normalize(y, dim=-1)
return 2 - 2 * (x * y).sum(dim=-1)
`
Masked Image Modeling
Inspired by BERT's success, masked image modeling has emerged as a powerful approach.
Masked Autoencoders (MAE)
MAE masks random patches and reconstructs the pixels:
`python
class MAE(nn.Module):
def __init__(self, encoder, decoder, mask_ratio=0.75):
super().__init__()
self.encoder = encoder
self.decoder = decoder
self.mask_ratio = mask_ratio
def random_masking(self, x, mask_ratio):
N, L, D = x.shape # batch, length, dim
len_keep = int(L * (1 - mask_ratio))
# Random shuffle
noise = torch.rand(N, L, device=x.device)
ids_shuffle = torch.argsort(noise, dim=1)
ids_restore = torch.argsort(ids_shuffle, dim=1)
# Keep first len_keep patches
ids_keep = ids_shuffle[:, :len_keep]
x_masked = torch.gather(x, dim=1,
index=ids_keep.unsqueeze(-1).repeat(1, 1, D))
# Generate mask
mask = torch.ones([N, L], device=x.device)
mask[:, :len_keep] = 0
mask = torch.gather(mask, dim=1, index=ids_restore)
return x_masked, mask, ids_restore
def forward(self, x):
# Patchify
patches = self.patchify(x)
# Embed patches
x = self.patch_embed(patches)
# Random masking
x_masked, mask, ids_restore = self.random_masking(x, self.mask_ratio)
# Encode visible patches
latent = self.encoder(x_masked)
# Decode full set of patches
pred = self.decoder(latent, ids_restore)
# Loss only on masked patches
loss = self.reconstruction_loss(patches, pred, mask)
return loss, pred, mask
`
BEiT (BERT Pre-Training of Image Transformers)
BEiT discretizes image patches into tokens using a learned vocabulary:
`python
class BEiT(nn.Module):
def __init__(self, encoder, tokenizer, vocab_size=8192):
super().__init__()
self.encoder = encoder
self.tokenizer = tokenizer # dVAE to convert patches to tokens
self.vocab_size = vocab_size
self.lm_head = nn.Linear(encoder.output_dim, vocab_size)
def forward(self, x, mask):
# Get discrete tokens for all patches
with torch.no_grad():
tokens = self.tokenizer.tokenize(x)
# Mask some patches
x_masked = self.apply_mask(x, mask)
# Encode
hidden = self.encoder(x_masked)
# Predict tokens for masked patches
predictions = self.lm_head(hidden)
# Cross-entropy loss on masked positions
loss = F.cross_entropy(
predictions[mask].view(-1, self.vocab_size),
tokens[mask].view(-1)
)
return loss
`
Self-Supervised Learning for Other Modalities
Audio
Wav2Vec 2.0: Contrastive learning on speech
- Mask portions of audio
- Predict masked segments from context
- Quantize audio into discrete units
Audio MAE: Masked autoencoding for spectrograms
Video
VideoMAE: Mask and reconstruct video frames
TimeSformer: Self-attention across space and time
Contrastive Video Representation Learning: Multiple views of the same video
Multimodal
CLIP: Contrastive learning between images and text
`python
class CLIP(nn.Module):
def __init__(self, image_encoder, text_encoder, embed_dim):
super().__init__()
self.image_encoder = image_encoder
self.text_encoder = text_encoder
self.image_projection = nn.Linear(image_encoder.output_dim, embed_dim)
self.text_projection = nn.Linear(text_encoder.output_dim, embed_dim)
self.temperature = nn.Parameter(torch.ones([]) * np.log(1 / 0.07))
def forward(self, images, texts):
# Encode
image_features = self.image_projection(self.image_encoder(images))
text_features = self.text_projection(self.text_encoder(texts))
# Normalize
image_features = F.normalize(image_features, dim=-1)
text_features = F.normalize(text_features, dim=-1)
# Contrastive loss
logits = torch.matmul(image_features, text_features.t()) * self.temperature.exp()
labels = torch.arange(len(images), device=logits.device)
loss_i2t = F.cross_entropy(logits, labels)
loss_t2i = F.cross_entropy(logits.t(), labels)
return (loss_i2t + loss_t2i) / 2
`
Evaluation and Transfer
Linear Probing
Freeze the pretrained encoder, train only a linear classifier:
`python
def linear_probe(pretrained_encoder, train_data, num_classes):
# Freeze encoder
for param in pretrained_encoder.parameters():
param.requires_grad = False
# Add linear classifier
classifier = nn.Linear(pretrained_encoder.output_dim, num_classes)
# Train only classifier
optimizer = optim.Adam(classifier.parameters())
for images, labels in train_data:
with torch.no_grad():
features = pretrained_encoder(images)
logits = classifier(features)
loss = F.cross_entropy(logits, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
Fine-Tuning
Unfreeze and train the entire network on downstream task:
`python
def fine_tune(pretrained_encoder, train_data, num_classes):
# Add classifier head
model = nn.Sequential(
pretrained_encoder,
nn.Linear(pretrained_encoder.output_dim, num_classes)
)
# Lower learning rate for pretrained layers
optimizer = optim.Adam([
{'params': pretrained_encoder.parameters(), 'lr': 1e-5},
{'params': model[-1].parameters(), 'lr': 1e-3}
])
# Train
for images, labels in train_data:
logits = model(images)
loss = F.cross_entropy(logits, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
Few-Shot Evaluation
Test with very limited labeled data (1-10 examples per class):
`python
def few_shot_evaluate(encoder, support_set, query_set):
"""Nearest neighbor classification for few-shot learning."""
with torch.no_grad():
# Encode support set
support_features = encoder(support_set['images'])
support_features = F.normalize(support_features, dim=1)
# Encode query set
query_features = encoder(query_set['images'])
query_features = F.normalize(query_features, dim=1)
# Nearest neighbor prediction
similarity = torch.mm(query_features, support_features.t())
predictions = support_set['labels'][similarity.argmax(dim=1)]
accuracy = (predictions == query_set['labels']).float().mean()
return accuracy
“
Best Practices
Training Tips
- Large batch sizes: Contrastive learning benefits from more negatives
- Strong augmentation: Critical for contrastive methods
- Long training: Self-supervised methods often need 100-400 epochs
- Proper momentum: For momentum-based methods, 0.99-0.999 works well
- Learning rate warmup: Gradual warmup stabilizes training
Architecture Choices
- Projection heads: Use MLP projectors, discard after pretraining
- Batch normalization: Important for avoiding collapse
- Large encoders: Self-supervision scales well with model size
Common Pitfalls
Representation Collapse: All representations become identical
- Solution: Use asymmetric architectures, stop gradients, or negatives
Shortcut Learning: Model learns trivial features
- Solution: Strong augmentations, careful pretext task design
Overfitting to Pretext Task: Representations don’t transfer
- Solution: Evaluate on downstream tasks, adjust pretext complexity
Conclusion
Self-supervised learning has transformed how we train neural networks, enabling models to learn from the vast amounts of unlabeled data available. From language modeling in NLP to contrastive learning and masked autoencoders in vision, these techniques have achieved remarkable results.
Key takeaways:
- Learning from data structure: Self-supervision creates labels from data itself
- Pretext tasks matter: Task design significantly affects representation quality
- Contrastive learning: Learn by comparing similar and dissimilar samples
- Masked prediction: Reconstruct masked portions of input
- Transfer is key: Success measured by downstream task performance
- Scale is important: Benefits from large models and extensive pretraining
As data continues to grow faster than our ability to label it, self-supervised learning will only become more important. Understanding these techniques is essential for anyone working in modern AI.