Zero-shot learning represents one of the most ambitious goals in artificial intelligence: enabling machines to recognize and classify objects or concepts they have never seen during training. By leveraging auxiliary information like semantic descriptions or attributes, zero-shot learning systems can generalize to entirely new categories without any labeled examples. This comprehensive guide explores the principles, methods, and applications of zero-shot learning.
The Zero-Shot Learning Problem
Defining Zero-Shot Learning
In traditional supervised learning, we train on classes A, B, C and test on the same classes. Zero-shot learning fundamentally changes this:
- Training: Learn from classes A, B, C (seen classes)
- Testing: Recognize classes X, Y, Z (unseen classes)
The key question: How can we recognize something we’ve never seen?
The Bridge: Auxiliary Information
Zero-shot learning relies on auxiliary information that connects seen and unseen classes:
- Attributes: Describable properties (has stripes, four legs, can fly)
- Text descriptions: Natural language definitions or Wikipedia articles
- Word embeddings: Semantic vectors from language models
- Knowledge graphs: Relationships between concepts
- Class hierarchies: Taxonomic structure
“python
# Example: Attribute-based zero-shot learning
class_attributes = {
'zebra': [1, 0, 1, 1, 0, 1, 0, 1], # black, white, stripes, four_legs, ...
'tiger': [0, 1, 1, 1, 1, 0, 0, 1], # orange, striped, carnivore, ...
'penguin': [1, 1, 0, 0, 0, 0, 1, 0], # black, white, flightless, ...
}
# Never seen a penguin during training, but know its attributes
# At test time: match image features to attribute descriptions
`
Settings and Evaluation
Conventional Zero-Shot Learning (ZSL): Test only on unseen classes
Generalized Zero-Shot Learning (GZSL): Test on both seen and unseen classes
- More realistic but harder
- Models tend to be biased toward seen classes
`python
# Evaluation metrics
def evaluate_zsl(predictions, labels, unseen_classes):
"""Accuracy on unseen classes only."""
mask = torch.isin(labels, unseen_classes)
correct = (predictions[mask] == labels[mask]).sum()
return correct / mask.sum()
def evaluate_gzsl(predictions, labels, seen_classes, unseen_classes):
"""Harmonic mean of seen and unseen accuracy."""
seen_mask = torch.isin(labels, seen_classes)
unseen_mask = torch.isin(labels, unseen_classes)
seen_acc = (predictions[seen_mask] == labels[seen_mask]).float().mean()
unseen_acc = (predictions[unseen_mask] == labels[unseen_mask]).float().mean()
harmonic = 2 * seen_acc * unseen_acc / (seen_acc + unseen_acc + 1e-8)
return seen_acc, unseen_acc, harmonic
`
Attribute-Based Zero-Shot Learning
Direct Attribute Prediction (DAP)
Predict attributes from images, then match to class:
`python
class DAP(nn.Module):
def __init__(self, encoder, num_attributes):
super().__init__()
self.encoder = encoder
self.attribute_predictor = nn.Sequential(
nn.Linear(encoder.output_dim, 512),
nn.ReLU(),
nn.Linear(512, num_attributes),
nn.Sigmoid()
)
self.class_attributes = None # Set during setup
def forward(self, images):
features = self.encoder(images)
predicted_attributes = self.attribute_predictor(features)
return predicted_attributes
def predict_class(self, images, class_attributes):
"""Predict class by matching predicted attributes to class signatures."""
pred_attrs = self.forward(images) # [batch, num_attributes]
# Compute compatibility with each class
# class_attributes: [num_classes, num_attributes]
compatibility = torch.mm(pred_attrs, class_attributes.t())
return compatibility.argmax(dim=1)
`
Indirect Attribute Prediction (IAP)
Predict seen class probabilities, then transfer via attributes:
`python
class IAP(nn.Module):
def __init__(self, encoder, seen_classes, unseen_classes, attributes):
super().__init__()
self.encoder = encoder
self.seen_classifier = nn.Linear(encoder.output_dim, len(seen_classes))
# Attribute matrices
self.seen_attrs = attributes[seen_classes] # [num_seen, num_attrs]
self.unseen_attrs = attributes[unseen_classes] # [num_unseen, num_attrs]
def forward(self, images):
features = self.encoder(images)
seen_logits = self.seen_classifier(features)
seen_probs = F.softmax(seen_logits, dim=1)
# Transfer to unseen via attributes
# P(unseen) ∝ sum over seen: P(seen) * similarity(seen, unseen)
similarity = torch.mm(self.seen_attrs, self.unseen_attrs.t())
similarity = F.normalize(similarity, dim=0)
unseen_probs = torch.mm(seen_probs, similarity)
return unseen_probs
`
Embedding-Based Zero-Shot Learning
Learning Visual-Semantic Embeddings
Map images and class representations to a shared space:
`python
class VisualSemanticEmbedding(nn.Module):
def __init__(self, image_encoder, text_encoder, embed_dim):
super().__init__()
self.image_encoder = image_encoder
self.text_encoder = text_encoder
# Projection layers
self.image_proj = nn.Linear(image_encoder.output_dim, embed_dim)
self.text_proj = nn.Linear(text_encoder.output_dim, embed_dim)
def encode_image(self, image):
features = self.image_encoder(image)
return F.normalize(self.image_proj(features), dim=-1)
def encode_text(self, text):
features = self.text_encoder(text)
return F.normalize(self.text_proj(features), dim=-1)
def forward(self, images, class_descriptions):
image_embeds = self.encode_image(images)
class_embeds = self.encode_text(class_descriptions)
# Compatibility scores
scores = torch.mm(image_embeds, class_embeds.t())
return scores
`
Word Vector Based ZSL
Use pretrained word embeddings for class names:
`python
import gensim
class WordVectorZSL(nn.Module):
def __init__(self, encoder, word_vectors, class_names):
super().__init__()
self.encoder = encoder
# Get word vectors for all classes
self.class_vectors = torch.stack([
torch.tensor(word_vectors[name])
for name in class_names
])
# Project visual features to word vector space
self.projection = nn.Linear(
encoder.output_dim,
self.class_vectors.size(-1)
)
def forward(self, images):
features = self.encoder(images)
projected = self.projection(features)
# Cosine similarity to class word vectors
projected = F.normalize(projected, dim=-1)
class_vecs = F.normalize(self.class_vectors, dim=-1)
scores = torch.mm(projected, class_vecs.t())
return scores
`
Structured Joint Embedding (SJE)
Bilinear compatibility function:
`python
class SJE(nn.Module):
def __init__(self, visual_dim, semantic_dim, embed_dim):
super().__init__()
# Bilinear compatibility: x^T W s
self.W = nn.Parameter(torch.randn(visual_dim, semantic_dim))
nn.init.xavier_uniform_(self.W)
def compatibility(self, visual_features, semantic_vectors):
"""Compute compatibility scores."""
# visual_features: [batch, visual_dim]
# semantic_vectors: [num_classes, semantic_dim]
# Bilinear: x^T W s
scores = torch.mm(
torch.mm(visual_features, self.W),
semantic_vectors.t()
)
return scores
def forward(self, visual_features, class_semantics, labels):
scores = self.compatibility(visual_features, class_semantics)
# Ranking loss: correct class should score higher
positive_scores = scores[range(len(labels)), labels]
# Max-margin loss
loss = 0
for i, (score, label) in enumerate(zip(scores, labels)):
# Hinge loss for each negative class
margins = score - positive_scores[i] + 1 # margin = 1
margins[label] = 0 # Don't penalize positive
loss += F.relu(margins).sum()
return loss / len(labels)
`
ALE (Attribute Label Embedding)
Learn to embed both images and labels:
`python
class ALE(nn.Module):
def __init__(self, visual_dim, attribute_dim, embed_dim):
super().__init__()
self.visual_embed = nn.Linear(visual_dim, embed_dim)
self.attribute_embed = nn.Linear(attribute_dim, embed_dim)
def forward(self, visual_features, class_attributes, labels):
# Embed visual features
V = self.visual_embed(visual_features) # [batch, embed]
# Embed class attributes
A = self.attribute_embed(class_attributes) # [num_classes, embed]
# Compatibility scores
scores = torch.mm(V, A.t())
# Classification loss
return F.cross_entropy(scores, labels)
def predict(self, visual_features, class_attributes):
V = self.visual_embed(visual_features)
A = self.attribute_embed(class_attributes)
scores = torch.mm(V, A.t())
return scores.argmax(dim=1)
`
Generative Approaches
Feature Generation
Generate visual features for unseen classes:
`python
class FeatureGenerator(nn.Module):
def __init__(self, semantic_dim, noise_dim, feature_dim):
super().__init__()
self.generator = nn.Sequential(
nn.Linear(semantic_dim + noise_dim, 512),
nn.LeakyReLU(0.2),
nn.Linear(512, 1024),
nn.LeakyReLU(0.2),
nn.Linear(1024, feature_dim),
nn.ReLU()
)
self.noise_dim = noise_dim
def forward(self, semantic_vector, batch_size=1):
# Sample noise
noise = torch.randn(batch_size, self.noise_dim)
# Repeat semantic vector
semantic = semantic_vector.unsqueeze(0).repeat(batch_size, 1)
# Generate features
input_vec = torch.cat([semantic, noise], dim=1)
return self.generator(input_vec)
class ConditionalGAN(nn.Module):
def __init__(self, semantic_dim, noise_dim, feature_dim):
super().__init__()
self.generator = FeatureGenerator(semantic_dim, noise_dim, feature_dim)
self.discriminator = nn.Sequential(
nn.Linear(feature_dim + semantic_dim, 512),
nn.LeakyReLU(0.2),
nn.Linear(512, 256),
nn.LeakyReLU(0.2),
nn.Linear(256, 1),
nn.Sigmoid()
)
self.noise_dim = noise_dim
def train_step(self, real_features, semantic_vectors, labels):
batch_size = real_features.size(0)
# Generate fake features
noise = torch.randn(batch_size, self.noise_dim)
fake_features = self.generator(semantic_vectors, batch_size)
# Discriminator loss
real_input = torch.cat([real_features, semantic_vectors], dim=1)
fake_input = torch.cat([fake_features.detach(), semantic_vectors], dim=1)
real_preds = self.discriminator(real_input)
fake_preds = self.discriminator(fake_input)
d_loss = -torch.mean(torch.log(real_preds + 1e-8) +
torch.log(1 - fake_preds + 1e-8))
# Generator loss
fake_preds_g = self.discriminator(
torch.cat([fake_features, semantic_vectors], dim=1)
)
g_loss = -torch.mean(torch.log(fake_preds_g + 1e-8))
return d_loss, g_loss
`
f-CLSWGAN
Feature-generating CLSWGAN for zero-shot learning:
`python
class CLSWGAN(nn.Module):
def __init__(self, attribute_dim, noise_dim, feature_dim, num_classes):
super().__init__()
self.generator = nn.Sequential(
nn.Linear(attribute_dim + noise_dim, 4096),
nn.LeakyReLU(0.2),
nn.Linear(4096, feature_dim),
nn.ReLU()
)
# Critic (WGAN-GP uses critic, not discriminator)
self.critic = nn.Sequential(
nn.Linear(feature_dim + attribute_dim, 4096),
nn.LeakyReLU(0.2),
nn.Linear(4096, 1)
)
# Classification loss for semantic consistency
self.classifier = nn.Linear(feature_dim, num_classes)
self.noise_dim = noise_dim
self.feature_dim = feature_dim
def generate(self, attributes, num_samples=100):
"""Generate features for given class attributes."""
noise = torch.randn(num_samples, self.noise_dim)
attrs = attributes.unsqueeze(0).repeat(num_samples, 1)
input_vec = torch.cat([attrs, noise], dim=1)
return self.generator(input_vec)
def gradient_penalty(self, real_features, fake_features, attributes):
"""Compute gradient penalty for WGAN-GP."""
alpha = torch.rand(real_features.size(0), 1)
interpolates = alpha * real_features + (1 - alpha) * fake_features
interpolates.requires_grad_(True)
critic_interpolates = self.critic(
torch.cat([interpolates, attributes], dim=1)
)
gradients = torch.autograd.grad(
outputs=critic_interpolates,
inputs=interpolates,
grad_outputs=torch.ones_like(critic_interpolates),
create_graph=True
)[0]
gradient_penalty = ((gradients.norm(2, dim=1) - 1) ** 2).mean()
return gradient_penalty
`
VAE-Based Generation
Variational autoencoder for feature generation:
`python
class CVAE(nn.Module):
"""Conditional VAE for zero-shot learning."""
def __init__(self, feature_dim, attribute_dim, latent_dim):
super().__init__()
# Encoder: features + attributes -> latent
self.encoder = nn.Sequential(
nn.Linear(feature_dim + attribute_dim, 1024),
nn.ReLU()
)
self.fc_mu = nn.Linear(1024, latent_dim)
self.fc_var = nn.Linear(1024, latent_dim)
# Decoder: latent + attributes -> features
self.decoder = nn.Sequential(
nn.Linear(latent_dim + attribute_dim, 1024),
nn.ReLU(),
nn.Linear(1024, feature_dim)
)
def encode(self, features, attributes):
x = torch.cat([features, attributes], dim=1)
h = self.encoder(x)
return self.fc_mu(h), self.fc_var(h)
def reparameterize(self, mu, log_var):
std = torch.exp(0.5 * log_var)
eps = torch.randn_like(std)
return mu + eps * std
def decode(self, z, attributes):
x = torch.cat([z, attributes], dim=1)
return self.decoder(x)
def forward(self, features, attributes):
mu, log_var = self.encode(features, attributes)
z = self.reparameterize(mu, log_var)
recon = self.decode(z, attributes)
return recon, mu, log_var
def generate(self, attributes, num_samples=100):
"""Generate features for unseen class."""
z = torch.randn(num_samples, self.latent_dim)
attrs = attributes.unsqueeze(0).repeat(num_samples, 1)
return self.decode(z, attrs)
`
CLIP and Large-Scale Zero-Shot
CLIP for Zero-Shot Classification
`python
import clip
class CLIPZeroShot:
def __init__(self, model_name='ViT-B/32'):
self.model, self.preprocess = clip.load(model_name)
self.model.eval()
def classify(self, images, class_names, prompt_template="a photo of a {}"):
"""Zero-shot classification using CLIP."""
# Create text prompts
prompts = [prompt_template.format(name) for name in class_names]
text_tokens = clip.tokenize(prompts)
with torch.no_grad():
# Encode images and text
image_features = self.model.encode_image(images)
text_features = self.model.encode_text(text_tokens)
# Normalize
image_features = image_features / image_features.norm(dim=-1, keepdim=True)
text_features = text_features / text_features.norm(dim=-1, keepdim=True)
# Compute similarity
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
return similarity.argmax(dim=1), similarity
def classify_with_prompts(self, images, class_names):
"""Use multiple prompts for better results."""
prompts_templates = [
"a photo of a {}",
"a picture of a {}",
"a {} in a photo",
"an image showing a {}",
"a photograph of a {}"
]
all_probs = []
for template in prompts_templates:
_, probs = self.classify(images, class_names, template)
all_probs.append(probs)
# Average probabilities
avg_probs = torch.stack(all_probs).mean(dim=0)
return avg_probs.argmax(dim=1), avg_probs
`
Prompt Engineering for Zero-Shot
`python
class PromptEnsemble:
"""Ensemble multiple prompts for better zero-shot performance."""
def __init__(self, clip_model):
self.model = clip_model
# Domain-specific prompt templates
self.templates = {
'general': [
"a photo of a {}",
"a picture of a {}",
"an image of a {}",
],
'animals': [
"a photo of a {}, a type of animal",
"a wildlife photo of a {}",
"a {} in its natural habitat",
],
'food': [
"a photo of {}, a type of food",
"a dish of {}",
"a plate of {}",
],
'places': [
"a photo of a {}, a type of place",
"a scenic view of {}",
"a photograph taken in {}",
]
}
def get_text_features(self, class_names, domain='general'):
"""Get averaged text features for classes."""
templates = self.templates[domain]
all_features = []
for template in templates:
prompts = [template.format(name) for name in class_names]
tokens = clip.tokenize(prompts)
with torch.no_grad():
features = self.model.encode_text(tokens)
features = features / features.norm(dim=-1, keepdim=True)
all_features.append(features)
# Average and renormalize
avg_features = torch.stack(all_features).mean(dim=0)
avg_features = avg_features / avg_features.norm(dim=-1, keepdim=True)
return avg_features
`
Handling the Bias Problem in GZSL
Calibrated Stacking
Reduce bias toward seen classes:
`python
class CalibratedGZSL:
def __init__(self, model, calibration_factor=0.5):
self.model = model
self.calibration_factor = calibration_factor
def predict(self, images, seen_classes, unseen_classes):
"""Calibrated prediction for GZSL."""
with torch.no_grad():
scores = self.model(images)
# Apply calibration: reduce seen class scores
scores[:, seen_classes] -= self.calibration_factor
return scores.argmax(dim=1)
def find_best_calibration(self, val_images, val_labels,
seen_classes, unseen_classes):
"""Find optimal calibration factor on validation set."""
best_harmonic = 0
best_factor = 0
for factor in np.arange(0, 1.5, 0.1):
self.calibration_factor = factor
predictions = self.predict(val_images, seen_classes, unseen_classes)
_, _, harmonic = evaluate_gzsl(
predictions, val_labels, seen_classes, unseen_classes
)
if harmonic > best_harmonic:
best_harmonic = harmonic
best_factor = factor
self.calibration_factor = best_factor
return best_factor
`
Semantic Consistency Loss
Ensure generated features are semantically correct:
`python
def semantic_consistency_loss(generated_features, attributes, classifier):
"""Ensure generated features can be correctly classified."""
# Predict attributes from generated features
predicted_attrs = classifier(generated_features)
# MSE loss between predicted and true attributes
return F.mse_loss(predicted_attrs, attributes)
def cycle_consistency_loss(generator, feature_encoder,
real_features, attributes):
"""Cycle consistency: encode features, generate back."""
# Generate features from attributes
fake_features = generator(attributes)
# Encode generated features back to attributes
reconstructed_attrs = feature_encoder(fake_features)
# Should match original attributes
return F.mse_loss(reconstructed_attrs, attributes)
`
Domain Adaptation for Zero-Shot
Cross-Modal Alignment
`python
class CrossModalAlignment(nn.Module):
def __init__(self, visual_dim, semantic_dim, hidden_dim):
super().__init__()
# Visual -> shared space
self.visual_encoder = nn.Sequential(
nn.Linear(visual_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim)
)
# Semantic -> shared space
self.semantic_encoder = nn.Sequential(
nn.Linear(semantic_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim)
)
# Domain discriminator
self.discriminator = nn.Sequential(
nn.Linear(hidden_dim, 256),
nn.ReLU(),
nn.Linear(256, 1),
nn.Sigmoid()
)
def forward(self, visual_features, semantic_vectors):
v_embed = self.visual_encoder(visual_features)
s_embed = self.semantic_encoder(semantic_vectors)
return v_embed, s_embed
def alignment_loss(self, visual_features, semantic_vectors, labels):
v_embed, s_embed = self.forward(visual_features, semantic_vectors)
# Contrastive alignment loss
similarity = torch.mm(F.normalize(v_embed, dim=1),
F.normalize(s_embed, dim=1).t())
# Cross-entropy: each visual should match its semantic
return F.cross_entropy(similarity, labels)
`
Applications
Open-Vocabulary Object Detection
`python
class OpenVocabDetector:
"""Detect objects from any class using text descriptions."""
def __init__(self, detector, clip_model):
self.detector = detector
self.clip_model = clip_model
def detect(self, image, class_names):
# Get region proposals
proposals = self.detector.get_proposals(image)
# Extract visual features for each proposal
visual_features = self.clip_model.encode_image(proposals)
# Text features for class names
prompts = [f"a photo of a {name}" for name in class_names]
text_features = self.clip_model.encode_text(prompts)
# Match proposals to classes
similarity = visual_features @ text_features.t()
predictions = similarity.argmax(dim=1)
return proposals, predictions, similarity
`
Zero-Shot Image Captioning
`python
class ZeroShotCaptioner:
"""Generate captions for novel visual concepts."""
def __init__(self, vision_encoder, language_model):
self.vision_encoder = vision_encoder
self.language_model = language_model
def generate_caption(self, image, novel_concepts):
# Encode image
visual_features = self.vision_encoder(image)
# Condition language model on visual features
# and any novel concept descriptions
concept_embeddings = self.get_concept_embeddings(novel_concepts)
# Generate caption
caption = self.language_model.generate(
visual_context=visual_features,
concept_context=concept_embeddings
)
return caption
`
Zero-Shot Action Recognition
`python
class ZeroShotActionRecognition:
def __init__(self, video_encoder, text_encoder):
self.video_encoder = video_encoder
self.text_encoder = text_encoder
def classify_action(self, video_frames, action_descriptions):
# Encode video
video_features = self.video_encoder(video_frames)
# Encode action descriptions
# e.g., "a person playing basketball"
text_features = self.text_encoder(action_descriptions)
# Similarity matching
similarity = F.cosine_similarity(
video_features.unsqueeze(1),
text_features.unsqueeze(0),
dim=-1
)
return similarity.argmax(dim=1)
`
Benchmarks and Datasets
Common Benchmarks
| Dataset | Classes | Attributes | Images | Split (seen/unseen) |
|---------|---------|------------|--------|---------------------|
| CUB-200 | 200 | 312 | 11,788 | 150/50 |
| SUN | 717 | 102 | 14,340 | 645/72 |
| AWA1 | 50 | 85 | 30,475 | 40/10 |
| AWA2 | 50 | 85 | 37,322 | 40/10 |
| aPY | 32 | 64 | 15,339 | 20/12 |
Evaluation Protocol
`python
def evaluate_zsl_gzsl(model, test_data, seen_classes, unseen_classes):
"""Complete evaluation for ZSL and GZSL."""
results = {}
# ZSL: Test only on unseen classes
unseen_mask = torch.isin(test_data['labels'], unseen_classes)
unseen_images = test_data['images'][unseen_mask]
unseen_labels = test_data['labels'][unseen_mask]
# Restrict predictions to unseen classes only
unseen_preds = model.predict(unseen_images, unseen_classes)
results['zsl_accuracy'] = (unseen_preds == unseen_labels).float().mean()
# GZSL: Test on all classes
all_preds = model.predict(test_data['images'],
torch.cat([seen_classes, unseen_classes]))
seen_mask = torch.isin(test_data['labels'], seen_classes)
unseen_mask = torch.isin(test_data['labels'], unseen_classes)
results['seen_accuracy'] = (all_preds[seen_mask] ==
test_data['labels'][seen_mask]).float().mean()
results['unseen_accuracy'] = (all_preds[unseen_mask] ==
test_data['labels'][unseen_mask]).float().mean()
results['harmonic_mean'] = (2 * results['seen_accuracy'] * results['unseen_accuracy'] /
(results['seen_accuracy'] + results['unseen_accuracy'] + 1e-8))
return results
“
Conclusion
Zero-shot learning represents a fundamental step toward more generalizable AI systems. By leveraging auxiliary information—attributes, text descriptions, or semantic embeddings—models can recognize concepts never seen during training.
Key takeaways:
- The bridge: Auxiliary information connects seen and unseen classes
- Embedding approaches: Map visual and semantic information to shared space
- Generative methods: Synthesize features for unseen classes
- Large-scale models: CLIP enables powerful zero-shot capabilities
- GZSL challenge: Handling bias toward seen classes
- Broad applications: Object detection, captioning, action recognition
As vision-language models like CLIP continue to advance, zero-shot learning becomes increasingly practical. The ability to recognize novel concepts without retraining opens doors to more flexible, adaptable AI systems that can handle the open-ended nature of the real world.