Neural Architecture Search (NAS) represents a paradigm shift in machine learning: instead of manually designing neural network architectures, we let algorithms discover optimal designs automatically. NAS has produced state-of-the-art models across computer vision, natural language processing, and beyond. This comprehensive guide explores the principles, methods, and practical applications of automated architecture design.
The Promise of Neural Architecture Search
Why Automate Architecture Design?
Manual architecture design has significant limitations:
- Expertise required: Designing good architectures requires deep expertise
- Time-consuming: Months of trial and error for novel domains
- Human bias: We only explore familiar design patterns
- Suboptimality: Human intuition may miss optimal configurations
NAS addresses these by:
- Systematically exploring large design spaces
- Finding non-intuitive but effective architectures
- Adapting architectures to specific hardware constraints
- Reducing the need for expert knowledge
NAS Success Stories
- NASNet: First NAS-discovered architecture to beat human designs on ImageNet
- EfficientNet: Compound scaling discovered through NAS
- MobileNetV3: NAS-optimized for mobile deployment
- AmoebaNet: Evolved architecture achieving state-of-the-art
- Evolved Transformer: NAS-designed transformer variant
The NAS Framework
Three Key Components
“python
class NASFramework:
"""
NAS consists of three main components:
- Search Space: What architectures are possible
- Search Strategy: How to explore the space
- Evaluation Strategy: How to assess architectures
"""
def __init__(self, search_space, search_strategy, evaluator):
self.search_space = search_space
self.search_strategy = search_strategy
self.evaluator = evaluator
def search(self, budget):
best_architecture = None
best_score = float('-inf')
while not self.budget_exhausted(budget):
# Sample architecture from search space
architecture = self.search_strategy.sample(self.search_space)
# Evaluate architecture
score = self.evaluator.evaluate(architecture)
# Update search strategy
self.search_strategy.update(architecture, score)
if score > best_score:
best_score = score
best_architecture = architecture
return best_architecture, best_score
`
Search Spaces
Cell-Based Search Space
Define a small cell and stack it to form the network:
`python
class CellSearchSpace:
"""
Search for a repeatable cell structure.
The full network is built by stacking cells.
"""
def __init__(self, num_nodes=4, num_ops=7):
self.num_nodes = num_nodes # Nodes in the cell
self.operations = [
'identity',
'conv_3x3',
'conv_5x5',
'sep_conv_3x3',
'sep_conv_5x5',
'max_pool_3x3',
'avg_pool_3x3',
'dilated_conv_3x3',
'skip_connect',
'none' # No connection
]
def sample_cell(self):
"""Sample a random cell architecture."""
cell = []
for node_idx in range(2, self.num_nodes + 2):
# Each node receives input from 2 previous nodes
input1 = random.randint(0, node_idx - 1)
input2 = random.randint(0, node_idx - 1)
op1 = random.choice(self.operations)
op2 = random.choice(self.operations)
cell.append((input1, op1, input2, op2))
return cell
def cell_to_network(self, normal_cell, reduction_cell, num_cells=8):
"""Build full network from cell specifications."""
layers = []
in_channels = 3
for i in range(num_cells):
if i in [num_cells // 3, 2 * num_cells // 3]:
# Reduction cell
cell = self.build_cell(reduction_cell, in_channels, in_channels * 2, stride=2)
in_channels *= 2
else:
# Normal cell
cell = self.build_cell(normal_cell, in_channels, in_channels, stride=1)
layers.append(cell)
return nn.Sequential(*layers)
class Cell(nn.Module):
def __init__(self, cell_spec, in_channels, out_channels, stride):
super().__init__()
self.cell_spec = cell_spec
self.ops = nn.ModuleList()
for input1, op1, input2, op2 in cell_spec:
self.ops.append(self._make_op(op1, in_channels, out_channels, stride))
self.ops.append(self._make_op(op2, in_channels, out_channels, stride))
def _make_op(self, op_name, in_ch, out_ch, stride):
if op_name == 'conv_3x3':
return nn.Conv2d(in_ch, out_ch, 3, stride, 1)
elif op_name == 'sep_conv_3x3':
return SeparableConv(in_ch, out_ch, 3, stride, 1)
elif op_name == 'max_pool_3x3':
return nn.MaxPool2d(3, stride, 1)
elif op_name == 'identity':
return nn.Identity() if stride == 1 else nn.Conv2d(in_ch, out_ch, 1, stride)
# ... other operations
`
Network-Level Search Space
Search for the entire network structure:
`python
class NetworkSearchSpace:
"""Search for network topology, not just cells."""
def __init__(self):
self.layer_types = ['conv', 'bottleneck', 'mbconv', 'transformer']
self.depth_range = (1, 10)
self.width_range = (16, 1024)
def sample_network(self):
"""Sample a complete network architecture."""
architecture = []
num_stages = random.randint(3, 6)
in_channels = 32
for stage in range(num_stages):
# Sample layer type
layer_type = random.choice(self.layer_types)
# Sample depth (number of layers in stage)
depth = random.randint(*self.depth_range)
# Sample width (channels)
width = random.choice([16, 32, 64, 128, 256, 512])
# Sample other hyperparameters
config = {
'layer_type': layer_type,
'depth': depth,
'width': width,
'kernel_size': random.choice([3, 5, 7]),
'expansion': random.choice([1, 2, 4, 6]),
'stride': 2 if stage > 0 else 1
}
architecture.append(config)
return architecture
`
Hardware-Aware Search Space
Include hardware constraints:
`python
class HardwareAwareSearchSpace:
def __init__(self, target_latency_ms, target_device='gpu'):
self.target_latency = target_latency_ms
self.device = target_device
# Precomputed latency lookup table
self.latency_lut = self._build_latency_lut()
def _build_latency_lut(self):
"""Build latency lookup table for operations."""
lut = {}
for op in ['conv3x3', 'conv5x5', 'dwconv3x3', 'dwconv5x5']:
for in_ch in [16, 32, 64, 128, 256, 512]:
for out_ch in [16, 32, 64, 128, 256, 512]:
for size in [224, 112, 56, 28, 14, 7]:
# Measure or estimate latency
latency = self._measure_latency(op, in_ch, out_ch, size)
lut[(op, in_ch, out_ch, size)] = latency
return lut
def estimate_latency(self, architecture):
"""Estimate total latency of an architecture."""
total_latency = 0
for layer_config in architecture:
key = (
layer_config['op'],
layer_config['in_channels'],
layer_config['out_channels'],
layer_config['size']
)
total_latency += self.latency_lut.get(key, 0)
return total_latency
def is_valid(self, architecture):
"""Check if architecture meets latency constraint."""
return self.estimate_latency(architecture) <= self.target_latency
`
Search Strategies
Reinforcement Learning
Use an RNN controller to generate architectures:
`python
class RLController(nn.Module):
"""LSTM controller that generates architecture descriptions."""
def __init__(self, search_space, hidden_size=100):
super().__init__()
self.search_space = search_space
self.hidden_size = hidden_size
self.lstm = nn.LSTMCell(hidden_size, hidden_size)
# Decoders for different architecture choices
self.op_decoder = nn.Linear(hidden_size, len(search_space.operations))
self.input_decoder = nn.Linear(hidden_size, search_space.num_nodes)
def forward(self):
"""Sample an architecture."""
batch_size = 1
# Initialize hidden state
h = torch.zeros(batch_size, self.hidden_size)
c = torch.zeros(batch_size, self.hidden_size)
architecture = []
log_probs = []
entropies = []
for node in range(self.search_space.num_nodes):
for i in range(2): # Two inputs per node
# Generate input selection
h, c = self.lstm(h, (h, c))
input_logits = self.input_decoder(h)
input_probs = F.softmax(input_logits, dim=-1)
input_idx = Categorical(input_probs).sample()
log_probs.append(input_probs[0, input_idx].log())
entropies.append(Categorical(input_probs).entropy())
# Generate operation selection
h, c = self.lstm(h, (h, c))
op_logits = self.op_decoder(h)
op_probs = F.softmax(op_logits, dim=-1)
op_idx = Categorical(op_probs).sample()
log_probs.append(op_probs[0, op_idx].log())
entropies.append(Categorical(op_probs).entropy())
architecture.append((input_idx.item(), op_idx.item()))
return architecture, torch.stack(log_probs), torch.stack(entropies)
class NASRLTrainer:
def __init__(self, controller, evaluator):
self.controller = controller
self.evaluator = evaluator
self.optimizer = torch.optim.Adam(controller.parameters(), lr=3.5e-4)
self.baseline = None
def train_step(self, num_samples=5):
"""REINFORCE algorithm."""
architectures = []
log_probs_list = []
rewards = []
# Sample architectures
for _ in range(num_samples):
arch, log_probs, entropies = self.controller()
architectures.append(arch)
log_probs_list.append(log_probs)
# Evaluate (expensive!)
accuracy = self.evaluator.evaluate(arch)
rewards.append(accuracy)
# Compute baseline
rewards = torch.tensor(rewards)
if self.baseline is None:
self.baseline = rewards.mean()
else:
self.baseline = 0.95 * self.baseline + 0.05 * rewards.mean()
# Policy gradient loss
loss = 0
for log_probs, reward in zip(log_probs_list, rewards):
advantage = reward - self.baseline
loss -= (log_probs * advantage).sum()
loss /= num_samples
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
return rewards.max().item()
`
Evolutionary Algorithms
Evolve architectures through mutation and selection:
`python
class EvolutionaryNAS:
def __init__(self, search_space, population_size=50, tournament_size=5):
self.search_space = search_space
self.population_size = population_size
self.tournament_size = tournament_size
# Initialize population
self.population = [
self.search_space.sample_random()
for _ in range(population_size)
]
self.fitness = [0] * population_size
def mutate(self, architecture):
"""Apply random mutation to architecture."""
mutated = copy.deepcopy(architecture)
# Choose mutation type
mutation = random.choice(['change_op', 'change_input', 'add_layer', 'remove_layer'])
if mutation == 'change_op':
# Change one operation
idx = random.randint(0, len(mutated) - 1)
mutated[idx]['op'] = random.choice(self.search_space.operations)
elif mutation == 'change_input':
# Change connection
idx = random.randint(0, len(mutated) - 1)
mutated[idx]['input'] = random.randint(0, idx)
return mutated
def tournament_select(self):
"""Select parent via tournament selection."""
candidates = random.sample(
range(self.population_size),
self.tournament_size
)
best_idx = max(candidates, key=lambda i: self.fitness[i])
return self.population[best_idx]
def evolve_generation(self, evaluator):
"""One generation of evolution."""
# Evaluate current population
for i, arch in enumerate(self.population):
if self.fitness[i] == 0: # Not yet evaluated
self.fitness[i] = evaluator.evaluate(arch)
# Create next generation
new_population = []
new_fitness = []
# Keep best (elitism)
best_idx = np.argmax(self.fitness)
new_population.append(self.population[best_idx])
new_fitness.append(self.fitness[best_idx])
# Generate rest through mutation
while len(new_population) < self.population_size:
parent = self.tournament_select()
child = self.mutate(parent)
new_population.append(child)
new_fitness.append(0) # Not yet evaluated
self.population = new_population
self.fitness = new_fitness
return max(self.fitness)
`
Differentiable NAS (DARTS)
Make architecture search differentiable:
`python
class DARTSCell(nn.Module):
"""Differentiable Architecture Search cell."""
def __init__(self, num_nodes, channels, operations):
super().__init__()
self.num_nodes = num_nodes
self.operations = operations
# Architecture parameters (to be learned)
self.alphas = nn.ParameterList()
# Operation modules
self.ops = nn.ModuleList()
for node in range(num_nodes):
node_alphas = nn.ParameterList()
node_ops = nn.ModuleList()
for prev_node in range(node + 2): # +2 for inputs
alpha = nn.Parameter(torch.randn(len(operations)))
node_alphas.append(alpha)
edge_ops = nn.ModuleList([
self._make_op(op, channels)
for op in operations
])
node_ops.append(edge_ops)
self.alphas.append(node_alphas)
self.ops.append(node_ops)
def forward(self, s0, s1):
"""
Forward with mixed operations.
Architecture weights are applied via softmax.
"""
states = [s0, s1]
for node_idx in range(self.num_nodes):
node_input = 0
for prev_idx, (alpha, edge_ops) in enumerate(
zip(self.alphas[node_idx], self.ops[node_idx])
):
# Softmax over operations
weights = F.softmax(alpha, dim=0)
# Mixed operation
mixed = sum(
w * op(states[prev_idx])
for w, op in zip(weights, edge_ops)
)
node_input = node_input + mixed
states.append(node_input)
# Concatenate all intermediate nodes
return torch.cat(states[2:], dim=1)
class DARTSTrainer:
def __init__(self, model, train_loader, val_loader):
self.model = model
self.train_loader = train_loader
self.val_loader = val_loader
# Separate optimizers for weights and architecture
self.weight_optimizer = torch.optim.SGD(
self._weight_params(), lr=0.025, momentum=0.9
)
self.arch_optimizer = torch.optim.Adam(
self._arch_params(), lr=3e-4
)
def _weight_params(self):
"""Get network weight parameters."""
for name, param in self.model.named_parameters():
if 'alpha' not in name:
yield param
def _arch_params(self):
"""Get architecture parameters."""
for name, param in self.model.named_parameters():
if 'alpha' in name:
yield param
def train_step(self, train_batch, val_batch):
"""Bi-level optimization step."""
# Update architecture on validation data
self.arch_optimizer.zero_grad()
val_loss = self._compute_loss(*val_batch)
val_loss.backward()
self.arch_optimizer.step()
# Update weights on training data
self.weight_optimizer.zero_grad()
train_loss = self._compute_loss(*train_batch)
train_loss.backward()
self.weight_optimizer.step()
return train_loss.item(), val_loss.item()
def derive_architecture(self):
"""Extract discrete architecture from continuous relaxation."""
architecture = []
for node_alphas in self.model.alphas:
node_ops = []
for alpha in node_alphas:
# Select top operation
best_op_idx = alpha.argmax().item()
node_ops.append(best_op_idx)
architecture.append(node_ops)
return architecture
`
Efficient Evaluation Strategies
Weight Sharing (One-Shot NAS)
`python
class SuperNet(nn.Module):
"""
One-shot supernet containing all candidate operations.
All architectures share weights.
"""
def __init__(self, search_space, channels):
super().__init__()
self.search_space = search_space
# Create all possible operations
self.ops = nn.ModuleDict()
for op_name in search_space.operations:
self.ops[op_name] = self._make_op(op_name, channels)
def forward(self, x, architecture):
"""Forward pass with a specific architecture."""
for layer_config in architecture:
op_name = layer_config['op']
x = self.opsop_name
return x
def sample_and_forward(self, x):
"""Sample random architecture and forward."""
arch = self.search_space.sample_random()
return self.forward(x, arch), arch
class OneShot Trainer:
def __init__(self, supernet, train_loader):
self.supernet = supernet
self.train_loader = train_loader
self.optimizer = torch.optim.SGD(
supernet.parameters(), lr=0.1
)
def train_epoch(self):
"""Train supernet by sampling different architectures."""
for images, labels in self.train_loader:
# Sample random architecture
arch = self.supernet.search_space.sample_random()
# Forward with sampled architecture
outputs = self.supernet(images, arch)
loss = F.cross_entropy(outputs, labels)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
def search(self, num_samples=1000, val_loader=None):
"""Search for best architecture using trained supernet."""
self.supernet.eval()
best_arch = None
best_acc = 0
for _ in range(num_samples):
arch = self.supernet.search_space.sample_random()
# Evaluate on validation set
acc = self.evaluate(arch, val_loader)
if acc > best_acc:
best_acc = acc
best_arch = arch
return best_arch, best_acc
`
Predictor-Based Evaluation
`python
class PerformancePredictor(nn.Module):
"""Predict architecture performance without training."""
def __init__(self, encoding_dim, hidden_dim=256):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(encoding_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU()
)
self.predictor = nn.Linear(hidden_dim, 1)
def forward(self, arch_encoding):
features = self.encoder(arch_encoding)
return self.predictor(features)
def encode_architecture(self, architecture):
"""Convert architecture to vector representation."""
encoding = []
for layer in architecture:
# One-hot encode operation type
op_onehot = [0] * len(self.operations)
op_onehot[self.operations.index(layer['op'])] = 1
encoding.extend(op_onehot)
# Add other features
encoding.append(layer['depth'] / 10)
encoding.append(layer['width'] / 512)
return torch.tensor(encoding, dtype=torch.float)
class PredictorGuidedNAS:
def __init__(self, search_space, predictor):
self.search_space = search_space
self.predictor = predictor
# Archive of evaluated architectures
self.archive = []
def search(self, budget, samples_per_round=100):
"""Search using predictor to guide exploration."""
while len(self.archive) < budget:
# Generate candidates
candidates = [
self.search_space.sample_random()
for _ in range(samples_per_round)
]
# Predict performance
encodings = [
self.predictor.encode_architecture(c)
for c in candidates
]
predictions = self.predictor(torch.stack(encodings))
# Select top candidates for actual evaluation
top_indices = predictions.squeeze().topk(10).indices
for idx in top_indices:
arch = candidates[idx]
accuracy = self.evaluate_architecture(arch)
self.archive.append((arch, accuracy))
# Update predictor with new data
self.update_predictor()
best = max(self.archive, key=lambda x: x[1])
return best
`
Zero-Cost Proxies
`python
class ZeroCostProxy:
"""Estimate architecture quality without training."""
@staticmethod
def synflow(model, images):
"""SynFlow: Gradient flow at initialization."""
model.zero_grad()
# All ones input
signs = {}
for name, param in model.named_parameters():
signs[name] = param.sign()
param.data = param.abs()
# Forward and backward
output = model(images)
output.sum().backward()
# Compute score
score = 0
for name, param in model.named_parameters():
score += (param.grad * param).sum()
# Restore signs
for name, param in model.named_parameters():
param.data = signs[name] * param.abs()
return score.item()
@staticmethod
def grad_norm(model, images, labels):
"""Gradient norm at initialization."""
model.zero_grad()
output = model(images)
loss = F.cross_entropy(output, labels)
loss.backward()
grad_norm = 0
for param in model.parameters():
if param.grad is not None:
grad_norm += param.grad.norm()
return grad_norm.item()
@staticmethod
def naswot(model, images):
"""NASWOT: Neural Architecture Search Without Training."""
# Based on overlap of activation patterns
activations = []
def hook(module, input, output):
activations.append((output > 0).float())
hooks = []
for module in model.modules():
if isinstance(module, nn.ReLU):
hooks.append(module.register_forward_hook(hook))
model(images)
for hook in hooks:
hook.remove()
# Compute kernel matrix
K = torch.zeros(len(images), len(images))
for act in activations:
act_flat = act.view(len(images), -1)
K += act_flat @ act_flat.t()
# Score is log determinant of kernel
score = torch.logdet(K).item()
return score
`
Multi-Objective NAS
Pareto-Optimal Search
`python
class MultiObjectiveNAS:
"""Search for Pareto-optimal architectures."""
def __init__(self, search_space, objectives):
self.search_space = search_space
self.objectives = objectives # e.g., ['accuracy', 'latency', 'params']
self.pareto_front = []
def dominates(self, arch1_scores, arch2_scores):
"""Check if arch1 dominates arch2."""
better_in_one = False
for obj, (s1, s2) in enumerate(zip(arch1_scores, arch2_scores)):
if s1 < s2: # Assuming minimization
if obj == 0: # Accuracy is maximized
return False
elif s1 > s2:
if obj == 0:
better_in_one = True
else:
return False
return better_in_one
def update_pareto_front(self, architecture, scores):
"""Update Pareto front with new architecture."""
# Check if dominated by existing members
for existing_arch, existing_scores in self.pareto_front:
if self.dominates(existing_scores, scores):
return # New arch is dominated
# Remove members dominated by new arch
self.pareto_front = [
(a, s) for a, s in self.pareto_front
if not self.dominates(scores, s)
]
self.pareto_front.append((architecture, scores))
def search(self, budget):
"""Evolutionary multi-objective search."""
population = [
self.search_space.sample_random()
for _ in range(50)
]
for generation in range(budget):
# Evaluate population
for arch in population:
scores = [
self.evaluate_objective(arch, obj)
for obj in self.objectives
]
self.update_pareto_front(arch, scores)
# Create next generation
population = self.evolve(population)
return self.pareto_front
`
Practical Applications
EfficientNet-Style Search
`python
def efficientnet_search():
"""Search for optimal scaling coefficients."""
from scipy.optimize import minimize
def objective(coefficients, target_flops):
depth, width, resolution = coefficients
# Build model with these coefficients
model = EfficientNetScaled(depth, width, resolution)
# Compute actual FLOPs
flops = compute_flops(model)
# Evaluate accuracy (expensive)
accuracy = train_and_evaluate(model)
# Penalize if over target FLOPs
penalty = max(0, flops - target_flops) * 0.1
return -accuracy + penalty
# Grid search for compound coefficient phi
best_coefficients = None
best_score = float('inf')
for phi in [0.5, 1.0, 1.5, 2.0, 2.5, 3.0]:
# Depth, width, resolution scaling
alpha = 1.2 # depth
beta = 1.1 # width
gamma = 1.15 # resolution
depth = alpha ** phi
width = beta ** phi
resolution = gamma ** phi
score = objective([depth, width, resolution], target_flops=500e6)
if score < best_score:
best_score = score
best_coefficients = [depth, width, resolution]
return best_coefficients
“
Conclusion
Neural Architecture Search has transformed how we design neural networks, moving from intuition-driven manual design to principled automated optimization. While computationally expensive, advances in weight sharing, predictors, and zero-cost proxies have made NAS increasingly practical.
Key takeaways:
- Search spaces define possibilities: Cell-based vs network-level trade different coverage for efficiency
- Search strategies vary: RL, evolution, and differentiable methods each have strengths
- Evaluation is the bottleneck: Weight sharing and predictors dramatically reduce cost
- Multi-objective optimization: Real deployments need to balance accuracy, speed, and size
- Hardware-aware search: Include target hardware constraints from the start
- Zero-cost proxies: Enable rapid architecture ranking without training
As compute resources grow and search methods improve, NAS will continue to push the boundaries of what’s possible in neural network design, potentially discovering architectures that no human would have conceived.