Zero-shot learning represents one of the most ambitious goals in artificial intelligence: enabling machines to recognize and classify objects or concepts they have never seen during training. By leveraging auxiliary information like semantic descriptions or attributes, zero-shot learning systems can generalize to entirely new categories without any labeled examples. This comprehensive guide explores the principles, methods, and applications of zero-shot learning.

The Zero-Shot Learning Problem

Defining Zero-Shot Learning

In traditional supervised learning, we train on classes A, B, C and test on the same classes. Zero-shot learning fundamentally changes this:

  • Training: Learn from classes A, B, C (seen classes)
  • Testing: Recognize classes X, Y, Z (unseen classes)

The key question: How can we recognize something we’ve never seen?

The Bridge: Auxiliary Information

Zero-shot learning relies on auxiliary information that connects seen and unseen classes:

  1. Attributes: Describable properties (has stripes, four legs, can fly)
  2. Text descriptions: Natural language definitions or Wikipedia articles
  3. Word embeddings: Semantic vectors from language models
  4. Knowledge graphs: Relationships between concepts
  5. Class hierarchies: Taxonomic structure

python

# Example: Attribute-based zero-shot learning

class_attributes = {

'zebra': [1, 0, 1, 1, 0, 1, 0, 1], # black, white, stripes, four_legs, ...

'tiger': [0, 1, 1, 1, 1, 0, 0, 1], # orange, striped, carnivore, ...

'penguin': [1, 1, 0, 0, 0, 0, 1, 0], # black, white, flightless, ...

}

# Never seen a penguin during training, but know its attributes

# At test time: match image features to attribute descriptions

`

Settings and Evaluation

Conventional Zero-Shot Learning (ZSL): Test only on unseen classes

Generalized Zero-Shot Learning (GZSL): Test on both seen and unseen classes

  • More realistic but harder
  • Models tend to be biased toward seen classes

`python

# Evaluation metrics

def evaluate_zsl(predictions, labels, unseen_classes):

"""Accuracy on unseen classes only."""

mask = torch.isin(labels, unseen_classes)

correct = (predictions[mask] == labels[mask]).sum()

return correct / mask.sum()

def evaluate_gzsl(predictions, labels, seen_classes, unseen_classes):

"""Harmonic mean of seen and unseen accuracy."""

seen_mask = torch.isin(labels, seen_classes)

unseen_mask = torch.isin(labels, unseen_classes)

seen_acc = (predictions[seen_mask] == labels[seen_mask]).float().mean()

unseen_acc = (predictions[unseen_mask] == labels[unseen_mask]).float().mean()

harmonic = 2 * seen_acc * unseen_acc / (seen_acc + unseen_acc + 1e-8)

return seen_acc, unseen_acc, harmonic

`

Attribute-Based Zero-Shot Learning

Direct Attribute Prediction (DAP)

Predict attributes from images, then match to class:

`python

class DAP(nn.Module):

def __init__(self, encoder, num_attributes):

super().__init__()

self.encoder = encoder

self.attribute_predictor = nn.Sequential(

nn.Linear(encoder.output_dim, 512),

nn.ReLU(),

nn.Linear(512, num_attributes),

nn.Sigmoid()

)

self.class_attributes = None # Set during setup

def forward(self, images):

features = self.encoder(images)

predicted_attributes = self.attribute_predictor(features)

return predicted_attributes

def predict_class(self, images, class_attributes):

"""Predict class by matching predicted attributes to class signatures."""

pred_attrs = self.forward(images) # [batch, num_attributes]

# Compute compatibility with each class

# class_attributes: [num_classes, num_attributes]

compatibility = torch.mm(pred_attrs, class_attributes.t())

return compatibility.argmax(dim=1)

`

Indirect Attribute Prediction (IAP)

Predict seen class probabilities, then transfer via attributes:

`python

class IAP(nn.Module):

def __init__(self, encoder, seen_classes, unseen_classes, attributes):

super().__init__()

self.encoder = encoder

self.seen_classifier = nn.Linear(encoder.output_dim, len(seen_classes))

# Attribute matrices

self.seen_attrs = attributes[seen_classes] # [num_seen, num_attrs]

self.unseen_attrs = attributes[unseen_classes] # [num_unseen, num_attrs]

def forward(self, images):

features = self.encoder(images)

seen_logits = self.seen_classifier(features)

seen_probs = F.softmax(seen_logits, dim=1)

# Transfer to unseen via attributes

# P(unseen) ∝ sum over seen: P(seen) * similarity(seen, unseen)

similarity = torch.mm(self.seen_attrs, self.unseen_attrs.t())

similarity = F.normalize(similarity, dim=0)

unseen_probs = torch.mm(seen_probs, similarity)

return unseen_probs

`

Embedding-Based Zero-Shot Learning

Learning Visual-Semantic Embeddings

Map images and class representations to a shared space:

`python

class VisualSemanticEmbedding(nn.Module):

def __init__(self, image_encoder, text_encoder, embed_dim):

super().__init__()

self.image_encoder = image_encoder

self.text_encoder = text_encoder

# Projection layers

self.image_proj = nn.Linear(image_encoder.output_dim, embed_dim)

self.text_proj = nn.Linear(text_encoder.output_dim, embed_dim)

def encode_image(self, image):

features = self.image_encoder(image)

return F.normalize(self.image_proj(features), dim=-1)

def encode_text(self, text):

features = self.text_encoder(text)

return F.normalize(self.text_proj(features), dim=-1)

def forward(self, images, class_descriptions):

image_embeds = self.encode_image(images)

class_embeds = self.encode_text(class_descriptions)

# Compatibility scores

scores = torch.mm(image_embeds, class_embeds.t())

return scores

`

Word Vector Based ZSL

Use pretrained word embeddings for class names:

`python

import gensim

class WordVectorZSL(nn.Module):

def __init__(self, encoder, word_vectors, class_names):

super().__init__()

self.encoder = encoder

# Get word vectors for all classes

self.class_vectors = torch.stack([

torch.tensor(word_vectors[name])

for name in class_names

])

# Project visual features to word vector space

self.projection = nn.Linear(

encoder.output_dim,

self.class_vectors.size(-1)

)

def forward(self, images):

features = self.encoder(images)

projected = self.projection(features)

# Cosine similarity to class word vectors

projected = F.normalize(projected, dim=-1)

class_vecs = F.normalize(self.class_vectors, dim=-1)

scores = torch.mm(projected, class_vecs.t())

return scores

`

Structured Joint Embedding (SJE)

Bilinear compatibility function:

`python

class SJE(nn.Module):

def __init__(self, visual_dim, semantic_dim, embed_dim):

super().__init__()

# Bilinear compatibility: x^T W s

self.W = nn.Parameter(torch.randn(visual_dim, semantic_dim))

nn.init.xavier_uniform_(self.W)

def compatibility(self, visual_features, semantic_vectors):

"""Compute compatibility scores."""

# visual_features: [batch, visual_dim]

# semantic_vectors: [num_classes, semantic_dim]

# Bilinear: x^T W s

scores = torch.mm(

torch.mm(visual_features, self.W),

semantic_vectors.t()

)

return scores

def forward(self, visual_features, class_semantics, labels):

scores = self.compatibility(visual_features, class_semantics)

# Ranking loss: correct class should score higher

positive_scores = scores[range(len(labels)), labels]

# Max-margin loss

loss = 0

for i, (score, label) in enumerate(zip(scores, labels)):

# Hinge loss for each negative class

margins = score - positive_scores[i] + 1 # margin = 1

margins[label] = 0 # Don't penalize positive

loss += F.relu(margins).sum()

return loss / len(labels)

`

ALE (Attribute Label Embedding)

Learn to embed both images and labels:

`python

class ALE(nn.Module):

def __init__(self, visual_dim, attribute_dim, embed_dim):

super().__init__()

self.visual_embed = nn.Linear(visual_dim, embed_dim)

self.attribute_embed = nn.Linear(attribute_dim, embed_dim)

def forward(self, visual_features, class_attributes, labels):

# Embed visual features

V = self.visual_embed(visual_features) # [batch, embed]

# Embed class attributes

A = self.attribute_embed(class_attributes) # [num_classes, embed]

# Compatibility scores

scores = torch.mm(V, A.t())

# Classification loss

return F.cross_entropy(scores, labels)

def predict(self, visual_features, class_attributes):

V = self.visual_embed(visual_features)

A = self.attribute_embed(class_attributes)

scores = torch.mm(V, A.t())

return scores.argmax(dim=1)

`

Generative Approaches

Feature Generation

Generate visual features for unseen classes:

`python

class FeatureGenerator(nn.Module):

def __init__(self, semantic_dim, noise_dim, feature_dim):

super().__init__()

self.generator = nn.Sequential(

nn.Linear(semantic_dim + noise_dim, 512),

nn.LeakyReLU(0.2),

nn.Linear(512, 1024),

nn.LeakyReLU(0.2),

nn.Linear(1024, feature_dim),

nn.ReLU()

)

self.noise_dim = noise_dim

def forward(self, semantic_vector, batch_size=1):

# Sample noise

noise = torch.randn(batch_size, self.noise_dim)

# Repeat semantic vector

semantic = semantic_vector.unsqueeze(0).repeat(batch_size, 1)

# Generate features

input_vec = torch.cat([semantic, noise], dim=1)

return self.generator(input_vec)

class ConditionalGAN(nn.Module):

def __init__(self, semantic_dim, noise_dim, feature_dim):

super().__init__()

self.generator = FeatureGenerator(semantic_dim, noise_dim, feature_dim)

self.discriminator = nn.Sequential(

nn.Linear(feature_dim + semantic_dim, 512),

nn.LeakyReLU(0.2),

nn.Linear(512, 256),

nn.LeakyReLU(0.2),

nn.Linear(256, 1),

nn.Sigmoid()

)

self.noise_dim = noise_dim

def train_step(self, real_features, semantic_vectors, labels):

batch_size = real_features.size(0)

# Generate fake features

noise = torch.randn(batch_size, self.noise_dim)

fake_features = self.generator(semantic_vectors, batch_size)

# Discriminator loss

real_input = torch.cat([real_features, semantic_vectors], dim=1)

fake_input = torch.cat([fake_features.detach(), semantic_vectors], dim=1)

real_preds = self.discriminator(real_input)

fake_preds = self.discriminator(fake_input)

d_loss = -torch.mean(torch.log(real_preds + 1e-8) +

torch.log(1 - fake_preds + 1e-8))

# Generator loss

fake_preds_g = self.discriminator(

torch.cat([fake_features, semantic_vectors], dim=1)

)

g_loss = -torch.mean(torch.log(fake_preds_g + 1e-8))

return d_loss, g_loss

`

f-CLSWGAN

Feature-generating CLSWGAN for zero-shot learning:

`python

class CLSWGAN(nn.Module):

def __init__(self, attribute_dim, noise_dim, feature_dim, num_classes):

super().__init__()

self.generator = nn.Sequential(

nn.Linear(attribute_dim + noise_dim, 4096),

nn.LeakyReLU(0.2),

nn.Linear(4096, feature_dim),

nn.ReLU()

)

# Critic (WGAN-GP uses critic, not discriminator)

self.critic = nn.Sequential(

nn.Linear(feature_dim + attribute_dim, 4096),

nn.LeakyReLU(0.2),

nn.Linear(4096, 1)

)

# Classification loss for semantic consistency

self.classifier = nn.Linear(feature_dim, num_classes)

self.noise_dim = noise_dim

self.feature_dim = feature_dim

def generate(self, attributes, num_samples=100):

"""Generate features for given class attributes."""

noise = torch.randn(num_samples, self.noise_dim)

attrs = attributes.unsqueeze(0).repeat(num_samples, 1)

input_vec = torch.cat([attrs, noise], dim=1)

return self.generator(input_vec)

def gradient_penalty(self, real_features, fake_features, attributes):

"""Compute gradient penalty for WGAN-GP."""

alpha = torch.rand(real_features.size(0), 1)

interpolates = alpha * real_features + (1 - alpha) * fake_features

interpolates.requires_grad_(True)

critic_interpolates = self.critic(

torch.cat([interpolates, attributes], dim=1)

)

gradients = torch.autograd.grad(

outputs=critic_interpolates,

inputs=interpolates,

grad_outputs=torch.ones_like(critic_interpolates),

create_graph=True

)[0]

gradient_penalty = ((gradients.norm(2, dim=1) - 1) ** 2).mean()

return gradient_penalty

`

VAE-Based Generation

Variational autoencoder for feature generation:

`python

class CVAE(nn.Module):

"""Conditional VAE for zero-shot learning."""

def __init__(self, feature_dim, attribute_dim, latent_dim):

super().__init__()

# Encoder: features + attributes -> latent

self.encoder = nn.Sequential(

nn.Linear(feature_dim + attribute_dim, 1024),

nn.ReLU()

)

self.fc_mu = nn.Linear(1024, latent_dim)

self.fc_var = nn.Linear(1024, latent_dim)

# Decoder: latent + attributes -> features

self.decoder = nn.Sequential(

nn.Linear(latent_dim + attribute_dim, 1024),

nn.ReLU(),

nn.Linear(1024, feature_dim)

)

def encode(self, features, attributes):

x = torch.cat([features, attributes], dim=1)

h = self.encoder(x)

return self.fc_mu(h), self.fc_var(h)

def reparameterize(self, mu, log_var):

std = torch.exp(0.5 * log_var)

eps = torch.randn_like(std)

return mu + eps * std

def decode(self, z, attributes):

x = torch.cat([z, attributes], dim=1)

return self.decoder(x)

def forward(self, features, attributes):

mu, log_var = self.encode(features, attributes)

z = self.reparameterize(mu, log_var)

recon = self.decode(z, attributes)

return recon, mu, log_var

def generate(self, attributes, num_samples=100):

"""Generate features for unseen class."""

z = torch.randn(num_samples, self.latent_dim)

attrs = attributes.unsqueeze(0).repeat(num_samples, 1)

return self.decode(z, attrs)

`

CLIP and Large-Scale Zero-Shot

CLIP for Zero-Shot Classification

`python

import clip

class CLIPZeroShot:

def __init__(self, model_name='ViT-B/32'):

self.model, self.preprocess = clip.load(model_name)

self.model.eval()

def classify(self, images, class_names, prompt_template="a photo of a {}"):

"""Zero-shot classification using CLIP."""

# Create text prompts

prompts = [prompt_template.format(name) for name in class_names]

text_tokens = clip.tokenize(prompts)

with torch.no_grad():

# Encode images and text

image_features = self.model.encode_image(images)

text_features = self.model.encode_text(text_tokens)

# Normalize

image_features = image_features / image_features.norm(dim=-1, keepdim=True)

text_features = text_features / text_features.norm(dim=-1, keepdim=True)

# Compute similarity

similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

return similarity.argmax(dim=1), similarity

def classify_with_prompts(self, images, class_names):

"""Use multiple prompts for better results."""

prompts_templates = [

"a photo of a {}",

"a picture of a {}",

"a {} in a photo",

"an image showing a {}",

"a photograph of a {}"

]

all_probs = []

for template in prompts_templates:

_, probs = self.classify(images, class_names, template)

all_probs.append(probs)

# Average probabilities

avg_probs = torch.stack(all_probs).mean(dim=0)

return avg_probs.argmax(dim=1), avg_probs

`

Prompt Engineering for Zero-Shot

`python

class PromptEnsemble:

"""Ensemble multiple prompts for better zero-shot performance."""

def __init__(self, clip_model):

self.model = clip_model

# Domain-specific prompt templates

self.templates = {

'general': [

"a photo of a {}",

"a picture of a {}",

"an image of a {}",

],

'animals': [

"a photo of a {}, a type of animal",

"a wildlife photo of a {}",

"a {} in its natural habitat",

],

'food': [

"a photo of {}, a type of food",

"a dish of {}",

"a plate of {}",

],

'places': [

"a photo of a {}, a type of place",

"a scenic view of {}",

"a photograph taken in {}",

]

}

def get_text_features(self, class_names, domain='general'):

"""Get averaged text features for classes."""

templates = self.templates[domain]

all_features = []

for template in templates:

prompts = [template.format(name) for name in class_names]

tokens = clip.tokenize(prompts)

with torch.no_grad():

features = self.model.encode_text(tokens)

features = features / features.norm(dim=-1, keepdim=True)

all_features.append(features)

# Average and renormalize

avg_features = torch.stack(all_features).mean(dim=0)

avg_features = avg_features / avg_features.norm(dim=-1, keepdim=True)

return avg_features

`

Handling the Bias Problem in GZSL

Calibrated Stacking

Reduce bias toward seen classes:

`python

class CalibratedGZSL:

def __init__(self, model, calibration_factor=0.5):

self.model = model

self.calibration_factor = calibration_factor

def predict(self, images, seen_classes, unseen_classes):

"""Calibrated prediction for GZSL."""

with torch.no_grad():

scores = self.model(images)

# Apply calibration: reduce seen class scores

scores[:, seen_classes] -= self.calibration_factor

return scores.argmax(dim=1)

def find_best_calibration(self, val_images, val_labels,

seen_classes, unseen_classes):

"""Find optimal calibration factor on validation set."""

best_harmonic = 0

best_factor = 0

for factor in np.arange(0, 1.5, 0.1):

self.calibration_factor = factor

predictions = self.predict(val_images, seen_classes, unseen_classes)

_, _, harmonic = evaluate_gzsl(

predictions, val_labels, seen_classes, unseen_classes

)

if harmonic > best_harmonic:

best_harmonic = harmonic

best_factor = factor

self.calibration_factor = best_factor

return best_factor

`

Semantic Consistency Loss

Ensure generated features are semantically correct:

`python

def semantic_consistency_loss(generated_features, attributes, classifier):

"""Ensure generated features can be correctly classified."""

# Predict attributes from generated features

predicted_attrs = classifier(generated_features)

# MSE loss between predicted and true attributes

return F.mse_loss(predicted_attrs, attributes)

def cycle_consistency_loss(generator, feature_encoder,

real_features, attributes):

"""Cycle consistency: encode features, generate back."""

# Generate features from attributes

fake_features = generator(attributes)

# Encode generated features back to attributes

reconstructed_attrs = feature_encoder(fake_features)

# Should match original attributes

return F.mse_loss(reconstructed_attrs, attributes)

`

Domain Adaptation for Zero-Shot

Cross-Modal Alignment

`python

class CrossModalAlignment(nn.Module):

def __init__(self, visual_dim, semantic_dim, hidden_dim):

super().__init__()

# Visual -> shared space

self.visual_encoder = nn.Sequential(

nn.Linear(visual_dim, hidden_dim),

nn.ReLU(),

nn.Linear(hidden_dim, hidden_dim)

)

# Semantic -> shared space

self.semantic_encoder = nn.Sequential(

nn.Linear(semantic_dim, hidden_dim),

nn.ReLU(),

nn.Linear(hidden_dim, hidden_dim)

)

# Domain discriminator

self.discriminator = nn.Sequential(

nn.Linear(hidden_dim, 256),

nn.ReLU(),

nn.Linear(256, 1),

nn.Sigmoid()

)

def forward(self, visual_features, semantic_vectors):

v_embed = self.visual_encoder(visual_features)

s_embed = self.semantic_encoder(semantic_vectors)

return v_embed, s_embed

def alignment_loss(self, visual_features, semantic_vectors, labels):

v_embed, s_embed = self.forward(visual_features, semantic_vectors)

# Contrastive alignment loss

similarity = torch.mm(F.normalize(v_embed, dim=1),

F.normalize(s_embed, dim=1).t())

# Cross-entropy: each visual should match its semantic

return F.cross_entropy(similarity, labels)

`

Applications

Open-Vocabulary Object Detection

`python

class OpenVocabDetector:

"""Detect objects from any class using text descriptions."""

def __init__(self, detector, clip_model):

self.detector = detector

self.clip_model = clip_model

def detect(self, image, class_names):

# Get region proposals

proposals = self.detector.get_proposals(image)

# Extract visual features for each proposal

visual_features = self.clip_model.encode_image(proposals)

# Text features for class names

prompts = [f"a photo of a {name}" for name in class_names]

text_features = self.clip_model.encode_text(prompts)

# Match proposals to classes

similarity = visual_features @ text_features.t()

predictions = similarity.argmax(dim=1)

return proposals, predictions, similarity

`

Zero-Shot Image Captioning

`python

class ZeroShotCaptioner:

"""Generate captions for novel visual concepts."""

def __init__(self, vision_encoder, language_model):

self.vision_encoder = vision_encoder

self.language_model = language_model

def generate_caption(self, image, novel_concepts):

# Encode image

visual_features = self.vision_encoder(image)

# Condition language model on visual features

# and any novel concept descriptions

concept_embeddings = self.get_concept_embeddings(novel_concepts)

# Generate caption

caption = self.language_model.generate(

visual_context=visual_features,

concept_context=concept_embeddings

)

return caption

`

Zero-Shot Action Recognition

`python

class ZeroShotActionRecognition:

def __init__(self, video_encoder, text_encoder):

self.video_encoder = video_encoder

self.text_encoder = text_encoder

def classify_action(self, video_frames, action_descriptions):

# Encode video

video_features = self.video_encoder(video_frames)

# Encode action descriptions

# e.g., "a person playing basketball"

text_features = self.text_encoder(action_descriptions)

# Similarity matching

similarity = F.cosine_similarity(

video_features.unsqueeze(1),

text_features.unsqueeze(0),

dim=-1

)

return similarity.argmax(dim=1)

`

Benchmarks and Datasets

Common Benchmarks

| Dataset | Classes | Attributes | Images | Split (seen/unseen) |

|---------|---------|------------|--------|---------------------|

| CUB-200 | 200 | 312 | 11,788 | 150/50 |

| SUN | 717 | 102 | 14,340 | 645/72 |

| AWA1 | 50 | 85 | 30,475 | 40/10 |

| AWA2 | 50 | 85 | 37,322 | 40/10 |

| aPY | 32 | 64 | 15,339 | 20/12 |

Evaluation Protocol

`python

def evaluate_zsl_gzsl(model, test_data, seen_classes, unseen_classes):

"""Complete evaluation for ZSL and GZSL."""

results = {}

# ZSL: Test only on unseen classes

unseen_mask = torch.isin(test_data['labels'], unseen_classes)

unseen_images = test_data['images'][unseen_mask]

unseen_labels = test_data['labels'][unseen_mask]

# Restrict predictions to unseen classes only

unseen_preds = model.predict(unseen_images, unseen_classes)

results['zsl_accuracy'] = (unseen_preds == unseen_labels).float().mean()

# GZSL: Test on all classes

all_preds = model.predict(test_data['images'],

torch.cat([seen_classes, unseen_classes]))

seen_mask = torch.isin(test_data['labels'], seen_classes)

unseen_mask = torch.isin(test_data['labels'], unseen_classes)

results['seen_accuracy'] = (all_preds[seen_mask] ==

test_data['labels'][seen_mask]).float().mean()

results['unseen_accuracy'] = (all_preds[unseen_mask] ==

test_data['labels'][unseen_mask]).float().mean()

results['harmonic_mean'] = (2 * results['seen_accuracy'] * results['unseen_accuracy'] /

(results['seen_accuracy'] + results['unseen_accuracy'] + 1e-8))

return results

Conclusion

Zero-shot learning represents a fundamental step toward more generalizable AI systems. By leveraging auxiliary information—attributes, text descriptions, or semantic embeddings—models can recognize concepts never seen during training.

Key takeaways:

  1. The bridge: Auxiliary information connects seen and unseen classes
  2. Embedding approaches: Map visual and semantic information to shared space
  3. Generative methods: Synthesize features for unseen classes
  4. Large-scale models: CLIP enables powerful zero-shot capabilities
  5. GZSL challenge: Handling bias toward seen classes
  6. Broad applications: Object detection, captioning, action recognition

As vision-language models like CLIP continue to advance, zero-shot learning becomes increasingly practical. The ability to recognize novel concepts without retraining opens doors to more flexible, adaptable AI systems that can handle the open-ended nature of the real world.

Leave a Reply

Your email address will not be published. Required fields are marked *