Few-shot learning represents one of the most practical and challenging frontiers in machine learning. While deep learning has achieved remarkable success, it typically requires thousands or millions of labeled examples. Few-shot learning tackles the realistic scenario where only a handful of examples are available for new classes. This comprehensive guide explores the techniques, algorithms, and practical applications of few-shot learning.
The Few-Shot Learning Problem
Definition and Motivation
Few-shot learning aims to learn new concepts from very few examples:
- 1-shot learning: Learn from a single example per class
- 5-shot learning: Learn from five examples per class
- Zero-shot learning: Learn without any examples (using auxiliary information)
This mirrors human cognitive abilities—we can recognize a new animal after seeing just one picture, or understand a new word from a single definition.
Why Traditional Deep Learning Fails
Deep neural networks struggle with limited data for several reasons:
- High capacity: Modern networks have millions of parameters that overfit easily
- No prior structure: They learn everything from scratch without leveraging prior knowledge
- Gradient-based optimization: Requires many iterations to converge
- Data augmentation limits: Can only stretch limited data so far
“python
# Demonstration of overfitting with limited data
def train_with_limited_data(model, few_examples):
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(1000):
outputs = model(few_examples['images'])
loss = F.cross_entropy(outputs, few_examples['labels'])
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Training accuracy quickly reaches 100%
# But test accuracy remains poor - overfitting!
`
The N-way K-shot Setting
Few-shot learning is typically evaluated in the N-way K-shot setting:
- N-way: N new classes to distinguish
- K-shot: K examples per class in the support set
- Query set: Examples to classify after seeing the support set
`python
# 5-way 1-shot example
n_way = 5 # 5 classes to distinguish
k_shot = 1 # 1 example per class
n_query = 15 # 15 query examples to classify
# Total: 5 support examples, 75 query examples
support_set_size = n_way * k_shot # 5
query_set_size = n_way * n_query # 75
`
Transfer Learning Approaches
Pretrain and Fine-tune
The simplest approach: pretrain on large dataset, fine-tune on few examples.
`python
class PretrainFinetune:
def __init__(self, pretrained_model, num_new_classes):
self.encoder = pretrained_model.encoder
# Replace classifier head
self.classifier = nn.Linear(
self.encoder.output_dim,
num_new_classes
)
# Freeze encoder initially
for param in self.encoder.parameters():
param.requires_grad = False
def finetune(self, support_set, epochs=100):
images, labels = support_set
# Unfreeze last few layers
for param in self.encoder.layer4.parameters():
param.requires_grad = True
optimizer = torch.optim.Adam([
{'params': self.encoder.layer4.parameters(), 'lr': 1e-5},
{'params': self.classifier.parameters(), 'lr': 1e-3}
])
for epoch in range(epochs):
features = self.encoder(images)
outputs = self.classifier(features)
loss = F.cross_entropy(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
`
Feature Extraction and Nearest Neighbor
Use pretrained features directly with simple classifier:
`python
class FeatureNN:
def __init__(self, pretrained_encoder):
self.encoder = pretrained_encoder
self.encoder.eval()
def predict(self, support_set, query_images):
support_images, support_labels = support_set
with torch.no_grad():
# Extract features
support_features = self.encoder(support_images)
query_features = self.encoder(query_images)
# Normalize
support_features = F.normalize(support_features, dim=1)
query_features = F.normalize(query_features, dim=1)
# Compute cosine similarity
similarity = query_features @ support_features.t()
# Predict nearest neighbor's label
nn_indices = similarity.argmax(dim=1)
predictions = support_labels[nn_indices]
return predictions
`
Linear Probing
Train only a linear classifier on frozen features:
`python
class LinearProbe:
def __init__(self, encoder, num_classes):
self.encoder = encoder
self.encoder.eval()
self.classifier = nn.Linear(encoder.output_dim, num_classes)
def fit(self, support_set, epochs=100):
images, labels = support_set
# Extract features once
with torch.no_grad():
features = self.encoder(images)
# Train linear classifier
optimizer = torch.optim.LBFGS(self.classifier.parameters())
def closure():
optimizer.zero_grad()
outputs = self.classifier(features)
loss = F.cross_entropy(outputs, labels)
loss.backward()
return loss
for _ in range(epochs):
optimizer.step(closure)
def predict(self, query_images):
with torch.no_grad():
features = self.encoder(query_images)
return self.classifier(features).argmax(dim=1)
`
Metric Learning for Few-Shot
Prototypical Networks
Create class prototypes from support examples:
`python
class PrototypicalNetwork(nn.Module):
def __init__(self, encoder):
super().__init__()
self.encoder = encoder
def compute_prototypes(self, support_features, support_labels):
n_way = support_labels.max() + 1
prototypes = torch.zeros(n_way, support_features.size(-1))
for c in range(n_way):
mask = (support_labels == c)
prototypes[c] = support_features[mask].mean(dim=0)
return prototypes
def forward(self, support_images, support_labels, query_images):
# Encode
support_features = self.encoder(support_images)
query_features = self.encoder(query_images)
# Compute prototypes
prototypes = self.compute_prototypes(support_features, support_labels)
# Compute distances
dists = torch.cdist(query_features, prototypes)
# Return negative distances as logits
return -dists
def predict(self, support_set, query_images):
support_images, support_labels = support_set
logits = self.forward(support_images, support_labels, query_images)
return logits.argmax(dim=1)
`
Matching Networks with Full Context Embeddings
Use attention over support set with context-aware embeddings:
`python
class MatchingNetwork(nn.Module):
def __init__(self, encoder, hidden_dim):
super().__init__()
self.encoder = encoder
# Bidirectional LSTM for full context embedding
self.support_lstm = nn.LSTM(
encoder.output_dim, hidden_dim,
bidirectional=True, batch_first=True
)
self.query_lstm = nn.LSTM(
encoder.output_dim, hidden_dim,
bidirectional=True, batch_first=True
)
def full_context_embed_support(self, support_features):
"""Use BiLSTM to create context-aware support embeddings."""
# support_features: [N*K, D]
output, _ = self.support_lstm(support_features.unsqueeze(0))
return output.squeeze(0)
def full_context_embed_query(self, query_features, support_context):
"""Attentive embedding of query with support context."""
batch_size = query_features.size(0)
hidden = None
for step in range(3): # Multiple attention steps
# Attention over support
attn = F.softmax(
torch.mm(query_features, support_context.t()),
dim=1
)
read = torch.mm(attn, support_context)
# LSTM step
lstm_input = torch.cat([query_features, read], dim=1)
output, hidden = self.query_lstm(
lstm_input.unsqueeze(1), hidden
)
query_features = output.squeeze(1) + query_features
return query_features
def forward(self, support_images, support_labels, query_images):
# Initial embeddings
support_features = self.encoder(support_images)
query_features = self.encoder(query_images)
# Full context embeddings
support_context = self.full_context_embed_support(support_features)
query_context = self.full_context_embed_query(
query_features, support_context
)
# Attention-based classification
attn = F.softmax(
torch.mm(query_context, support_context.t()),
dim=1
)
# Weighted sum of one-hot labels
support_onehot = F.one_hot(support_labels).float()
predictions = torch.mm(attn, support_onehot)
return predictions
`
Induction Networks
Learn class-level representations through induction:
`python
class InductionNetwork(nn.Module):
def __init__(self, encoder, relation_dim):
super().__init__()
self.encoder = encoder
# Induction module: aggregate class features
self.induction = nn.Sequential(
nn.Linear(encoder.output_dim, relation_dim),
nn.ReLU(),
nn.Linear(relation_dim, encoder.output_dim)
)
# Relation module
self.relation = nn.Sequential(
nn.Linear(encoder.output_dim * 2, relation_dim),
nn.ReLU(),
nn.Linear(relation_dim, 1),
nn.Sigmoid()
)
def forward(self, support_images, support_labels, query_images):
support_features = self.encoder(support_images)
query_features = self.encoder(query_images)
n_way = support_labels.max() + 1
n_query = query_images.size(0)
# Induce class representations
class_vectors = []
for c in range(n_way):
mask = (support_labels == c)
class_features = support_features[mask]
# Dynamic routing or attention-based aggregation
induced = self.induction(class_features.mean(dim=0))
class_vectors.append(induced)
class_vectors = torch.stack(class_vectors) # [N, D]
# Compute relations
relations = torch.zeros(n_query, n_way)
for i, query in enumerate(query_features):
for c, class_vec in enumerate(class_vectors):
pair = torch.cat([query, class_vec])
relations[i, c] = self.relation(pair)
return relations
`
Data Augmentation for Few-Shot
Traditional Augmentation
Apply heavy augmentation to stretch limited data:
`python
few_shot_augmentation = transforms.Compose([
transforms.RandomResizedCrop(84, scale=(0.5, 1.0)),
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(15),
transforms.ColorJitter(
brightness=0.4, contrast=0.4,
saturation=0.4, hue=0.1
),
transforms.RandomGrayscale(p=0.1),
transforms.ToTensor(),
transforms.Normalize(mean, std)
])
def augment_support_set(support_set, augmentation, n_augment=10):
"""Create augmented versions of support examples."""
images, labels = support_set
augmented_images = []
augmented_labels = []
for img, label in zip(images, labels):
for _ in range(n_augment):
aug_img = augmentation(img)
augmented_images.append(aug_img)
augmented_labels.append(label)
return torch.stack(augmented_images), torch.tensor(augmented_labels)
`
Learned Augmentation
Learn task-specific augmentations:
`python
class MetaAugmentation(nn.Module):
"""Learn to generate useful augmented examples."""
def __init__(self, feature_dim, augment_dim):
super().__init__()
# Transformation network
self.transform_net = nn.Sequential(
nn.Linear(feature_dim + augment_dim, feature_dim),
nn.ReLU(),
nn.Linear(feature_dim, feature_dim)
)
def forward(self, features, n_augment=5):
batch_size = features.size(0)
augmented = [features]
for _ in range(n_augment):
# Random augmentation code
z = torch.randn(batch_size, self.augment_dim)
# Generate augmented features
input_aug = torch.cat([features, z], dim=1)
aug_features = self.transform_net(input_aug)
augmented.append(aug_features)
return torch.cat(augmented, dim=0)
`
Hallucination Networks
Generate synthetic examples from support set:
`python
class HallucinationNetwork(nn.Module):
"""Generate synthetic examples for each class."""
def __init__(self, encoder, generator):
super().__init__()
self.encoder = encoder
self.generator = generator
def hallucinate(self, support_images, support_labels, n_hallucinate=10):
support_features = self.encoder(support_images)
n_way = support_labels.max() + 1
hallucinated_features = []
hallucinated_labels = []
for c in range(n_way):
mask = (support_labels == c)
class_features = support_features[mask]
# Generate synthetic features
for _ in range(n_hallucinate):
noise = torch.randn_like(class_features[0])
synthetic = self.generator(class_features.mean(0), noise)
hallucinated_features.append(synthetic)
hallucinated_labels.append(c)
return (
torch.stack(hallucinated_features),
torch.tensor(hallucinated_labels)
)
`
Transductive Few-Shot Learning
Use query set statistics during inference:
`python
class TransductiveFewShot(nn.Module):
"""Leverage unlabeled query examples for better predictions."""
def __init__(self, encoder):
super().__init__()
self.encoder = encoder
def forward(self, support_images, support_labels, query_images,
n_iterations=10):
# Encode all examples
support_features = self.encoder(support_images)
query_features = self.encoder(query_images)
n_way = support_labels.max() + 1
# Initial prototypes from support set
prototypes = self.compute_prototypes(
support_features, support_labels, n_way
)
# Iteratively refine with query set (soft labels)
for iteration in range(n_iterations):
# Compute soft assignments for queries
dists = torch.cdist(query_features, prototypes)
soft_labels = F.softmax(-dists, dim=1)
# Recompute prototypes including soft-labeled queries
new_prototypes = []
for c in range(n_way):
# Support contribution
support_mask = (support_labels == c)
support_contrib = support_features[support_mask].sum(0)
support_count = support_mask.sum()
# Query contribution (weighted)
query_weights = soft_labels[:, c]
query_contrib = (query_features * query_weights.unsqueeze(1)).sum(0)
query_count = query_weights.sum()
# Combined prototype
new_proto = (support_contrib + query_contrib) / (
support_count + query_count
)
new_prototypes.append(new_proto)
prototypes = torch.stack(new_prototypes)
# Final predictions
dists = torch.cdist(query_features, prototypes)
return -dists
def compute_prototypes(self, features, labels, n_way):
prototypes = []
for c in range(n_way):
mask = (labels == c)
prototypes.append(features[mask].mean(0))
return torch.stack(prototypes)
`
Label Propagation
Propagate labels through feature similarity graph:
`python
class LabelPropagation:
def __init__(self, encoder, alpha=0.5, n_iterations=20):
self.encoder = encoder
self.alpha = alpha
self.n_iterations = n_iterations
def predict(self, support_set, query_images):
support_images, support_labels = support_set
n_support = support_images.size(0)
n_query = query_images.size(0)
n_way = support_labels.max() + 1
# Encode all
all_images = torch.cat([support_images, query_images])
all_features = self.encoder(all_images)
all_features = F.normalize(all_features, dim=1)
# Build affinity matrix
W = all_features @ all_features.t()
W = F.softmax(W / 0.1, dim=1) # Temperature-scaled softmax
# Initialize labels (one-hot for support, zeros for query)
Y = torch.zeros(n_support + n_query, n_way)
for i, label in enumerate(support_labels):
Y[i, label] = 1.0
# Label propagation iterations
for _ in range(self.n_iterations):
Y_new = self.alpha * W @ Y + (1 - self.alpha) * Y
# Clamp support labels
for i, label in enumerate(support_labels):
Y_new[i] = 0
Y_new[i, label] = 1.0
Y = Y_new
# Extract query predictions
query_preds = Y[n_support:].argmax(dim=1)
return query_preds
`
Cross-Domain Few-Shot Learning
Handle domain shift between training and testing:
`python
class CrossDomainFewShot(nn.Module):
def __init__(self, encoder, domain_adapter):
super().__init__()
self.encoder = encoder
self.domain_adapter = domain_adapter
def adapt_features(self, features, domain_stats):
"""Adapt features to target domain statistics."""
# Compute current stats
mean = features.mean(dim=0)
std = features.std(dim=0)
# Normalize
normalized = (features - mean) / (std + 1e-5)
# Apply target domain stats
adapted = normalized * domain_stats['std'] + domain_stats['mean']
return adapted
def forward(self, support_images, support_labels, query_images):
# Encode
support_features = self.encoder(support_images)
query_features = self.encoder(query_images)
# Estimate target domain statistics from support + query
all_features = torch.cat([support_features, query_features])
domain_stats = {
'mean': all_features.mean(dim=0),
'std': all_features.std(dim=0)
}
# Domain adaptation
adapted_support = self.domain_adapter(support_features, domain_stats)
adapted_query = self.domain_adapter(query_features, domain_stats)
# Prototypical classification
prototypes = self.compute_prototypes(adapted_support, support_labels)
dists = torch.cdist(adapted_query, prototypes)
return -dists
`
Practical Implementation
Training Pipeline
`python
def train_few_shot_model(model, train_dataset, val_dataset, config):
optimizer = torch.optim.Adam(model.parameters(), lr=config['lr'])
scheduler = torch.optim.lr_scheduler.StepLR(
optimizer, step_size=20, gamma=0.5
)
best_accuracy = 0
for epoch in range(config['epochs']):
model.train()
train_loss = 0
for episode in range(config['episodes_per_epoch']):
# Sample task
task = sample_task(
train_dataset,
config['n_way'],
config['k_shot'],
config['n_query']
)
# Forward pass
logits = model(
task.support_images,
task.support_labels,
task.query_images
)
loss = F.cross_entropy(logits, task.query_labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
train_loss += loss.item()
scheduler.step()
# Validation
val_accuracy = evaluate(
model, val_dataset,
config['n_way'], config['k_shot'],
num_episodes=600
)
if val_accuracy > best_accuracy:
best_accuracy = val_accuracy
torch.save(model.state_dict(), 'best_model.pt')
print(f"Epoch {epoch}: Loss={train_loss/config['episodes_per_epoch']:.4f}, "
f"Val Acc={val_accuracy:.2f}%")
def evaluate(model, dataset, n_way, k_shot, num_episodes=1000):
model.eval()
accuracies = []
with torch.no_grad():
for _ in range(num_episodes):
task = sample_task(dataset, n_way, k_shot, n_query=15)
predictions = model.predict(
(task.support_images, task.support_labels),
task.query_images
)
accuracy = (predictions == task.query_labels).float().mean()
accuracies.append(accuracy.item())
return np.mean(accuracies) * 100
`
Ensemble Methods
Combine multiple few-shot models:
`python
class FewShotEnsemble:
def __init__(self, models):
self.models = models
def predict(self, support_set, query_images):
all_logits = []
for model in self.models:
model.eval()
with torch.no_grad():
logits = model(
support_set[0], support_set[1], query_images
)
all_logits.append(F.softmax(logits, dim=1))
# Average predictions
ensemble_probs = torch.stack(all_logits).mean(dim=0)
return ensemble_probs.argmax(dim=1)
`
Applications
Medical Imaging
Diagnose rare diseases with few examples:
`python
# Rare disease classification
# Support: 5 scans showing the rare condition
# Query: New patient scans to classify
`
Robotics
Quick adaptation to new objects or tasks:
`python
# Object manipulation
# Support: 3 demonstrations of grasping new object
# Query: Grasp the object in new orientations
`
Personalized AI
Adapt to user preferences with minimal examples:
`python
# Content recommendation
# Support: User's 5 explicitly rated items
# Query: Predict preferences for unseen items
`
Quality Control
Detect new types of defects:
`python
# Manufacturing defect detection
# Support: 3-5 examples of new defect type
# Query: Identify defective products
`
Wildlife Monitoring
Identify rare species:
`python
# Species identification
# Support: Few images of endangered species
# Query: Classify camera trap images
`
Benchmarks and Results
Common Benchmarks
| Dataset | Classes | Images | Typical Results (5-way 5-shot) |
|---------|---------|--------|-------------------------------|
| Omniglot | 1,623 | 32,460 | ~99% |
| miniImageNet | 100 | 60,000 | ~70-80% |
| tieredImageNet | 608 | 779,165 | ~72-82% |
| CIFAR-FS | 100 | 60,000 | ~75-85% |
| CUB-200 | 200 | 11,788 | ~80-88% |
State-of-the-Art Methods
`python
# Approximate performance on miniImageNet (5-way 5-shot)
results = {
'Prototypical Networks': 68.2,
'Matching Networks': 65.7,
'MAML': 63.1,
'Relation Networks': 67.1,
'MetaOptNet': 78.6,
'DeepEMD': 75.6,
'FEAT': 78.5,
'SimpleShot (transfer)': 70.1,
}
“
Conclusion
Few-shot learning bridges the gap between data-hungry deep learning and human-like rapid learning. By leveraging meta-learning, metric learning, and careful data augmentation, few-shot methods enable AI systems to generalize from minimal examples.
Key takeaways:
- Problem setting: N-way K-shot classification with support and query sets
- Transfer learning: Pretrained features provide strong baselines
- Metric learning: Learn embedding spaces for comparison-based classification
- Transductive methods: Leverage query set statistics for improvement
- Data augmentation: Critical for stretching limited examples
- Cross-domain: Additional challenges when training and test domains differ
Few-shot learning is essential for deploying AI in domains where labeled data is scarce, expensive, or constantly evolving. As AI systems are expected to handle more diverse and dynamic scenarios, few-shot learning capabilities become increasingly important.