
Fine-tuning, distillation, prompt engineering, and multi-model architecture — the technical playbook for building SLM-powered agents that actually work in production.
Why this article exists
There is no shortage of articles arguing that small language models are the future of agentic AI. The NVIDIA Research position paper from mid-2025 made a compelling case: SLMs are powerful enough, operationally more suitable, and 10–30× more economical than frontier LLMs for the repetitive, specialized tasks that agents actually perform. The research community agrees. The economics agree. The benchmarks agree.
What is missing is the how. Most coverage stops at the thesis — SLMs are good for agents — and never gets into the engineering work required to make them good for your agents. Fine-tuning a 7B model on a consumer GPU is one thing. Getting that model to reliably produce valid JSON tool calls at 50 requests per second inside a multi-model orchestration pipeline is another.
This article covers the technical depth that practitioners need. It is organized around three areas: specializing SLMs through fine-tuning and distillation, engineering prompts for models that do not tolerate ambiguity, and designing the multi-model architectures that put it all together.
1. Specializing SLMs: fine-tuning and distillation
The goal is to take a general-purpose small model — say, a Llama 3.1 8B or Qwen2.5 7B — and turn it into a narrow specialist that outperforms a frontier model on your specific agentic task. There are two complementary paths: fine-tuning on task-specific data, and distilling knowledge from a larger teacher model. In practice, the best results come from combining both.
1.1 Parameter-efficient fine-tuning: LoRA and QLoRA in depth
Full fine-tuning of a 7B model requires 100–120GB of VRAM. LoRA (Low-Rank Adaptation) reduces this by freezing the base model and inserting small trainable matrices into its linear layers. You train 0.5–2% of the parameters. The adapter files weigh megabytes, not gigabytes, and can be swapped at inference time without reloading the base model.
QLoRA takes this further: it quantizes the frozen base to 4-bit NormalFloat (NF4) before applying LoRA, cutting memory by another 75%. Paged optimizers spill state to CPU during spikes. The net effect: a 7B fine-tune fits on an 8GB GPU. A 13B model fits on a 24GB RTX 4090.
The quality tradeoff is well-characterized at this point. LoRA recovers 90–95% of full fine-tuning performance. QLoRA hits 80–90%. For agentic tasks — which are narrow, well-defined, and reinforced by constrained decoding — this gap is typically negligible.
The setup
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
import torch
MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"
# --- Quantization config ---
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # quantize the quantization constants
)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb_config,
device_map="auto",
attn_implementation="flash_attention_2", # if supported
)
model = prepare_model_for_kbit_training(model)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token
# --- LoRA config ---
lora_config = LoraConfig(
r=16, # rank: 16 for most tasks, 32-64 for heavy domain shift
lora_alpha=32, # scaling = alpha/r
target_modules="all-linear", # 2025+ consensus: all linear layers, not just attn
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
print(f"Trainable: {model.print_trainable_parameters()}")
# Typically ~0.5-1.5% of total params
Hyperparameters that matter
The community has converged on stable defaults through extensive experimentation in 2025:
Rank (r). Start at 16. This handles most instruction-following, tool-use, and classification tasks. Move to 32–64 only when your task involves a significant domain shift — teaching a model an entirely new vocabulary or reasoning pattern it barely encountered in pretraining. Research from late 2025 (arXiv:2512.15634) confirmed that intermediate ranks offer the best capacity-stability balance, marking a shift from earlier “r=8 is enough” guidance.
Target modules. The older practice of targeting only q_proj and v_proj (query and value projections in attention) has been superseded. Applying LoRA to all linear layers — target_modules="all-linear" — captures more behavioral surface area. The parameter overhead is modest (from ~0.3% to ~1.5% of total), and the quality improvement is consistent.
Learning rate. 2e-4 with cosine decay and 3–5% warmup steps. This is robust across model families. Lower (1e-4) if you see instability; rarely need to go higher.
Optimizer. paged_adamw_32bit for QLoRA setups. The paging prevents OOM crashes when optimizer state exceeds GPU memory during gradient accumulation.
Batch size and gradient accumulation. Effective batch size of 16–32 works well. On a single GPU with limited memory, use per_device_train_batch_size=2 with gradient_accumulation_steps=8 to reach an effective 16.
training_args = TrainingArguments(
output_dir="./checkpoints",
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # effective batch = 16
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
weight_decay=0.001,
optim="paged_adamw_32bit",
bf16=True,
logging_steps=10,
save_strategy="steps",
save_steps=100,
max_grad_norm=0.3,
)
Building the training dataset
This is where most projects succeed or fail, and it has nothing to do with model architecture. The quality of your fine-tuning data is the single most consequential variable.
For agentic fine-tuning, your dataset should contain structured input-output pairs that mirror the exact distribution of tasks your agent will encounter. There are three main sources:
Production traces from your existing LLM agent. If you are already running an agent on a frontier API, you have a goldmine of training data. Instrument every model call to log the prompt, the output, and whether the output was accepted (used by the downstream system) or rejected (triggered a retry or fallback). Filter for accepted outputs only. This gives you a dataset of “what the big model does well” — which is exactly what you want the small model to learn.
Manually curated examples. For new tasks without production data, human-written examples are the highest-quality starting point. The bar for “enough” is lower than most people think: 1,000–5,000 high-quality examples is the minimum viable dataset for most agentic tasks. The key is coverage of edge cases and output format diversity, not sheer volume.
Synthetic data from a teacher model. This bridges the gap when you need more volume or diversity than manual curation can provide. Have a frontier model generate variations of your inputs — paraphrased queries, adversarial edge cases, domain-shifted examples — and then generate the corresponding outputs. We will cover this in depth in the distillation section.
The formatting matters. Structure your examples in the chat template that your base model expects:
{
"messages": [
{
"role": "system",
"content": "You are a tool-calling agent. When the user describes an action, output a JSON tool call. Schema: {\"tool\": \"string\", \"params\": {\"key\": \"value\"}}"
},
{
"role": "user",
"content": "Send an email to [email protected] about the Q3 report"
},
{
"role": "assistant",
"content": "{\"tool\": \"send_email\", \"params\": {\"to\": \"[email protected]\", \"subject\": \"Q3 Report\", \"body\": \"Hi Maria, please find the Q3 report attached.\"}}"
}
]
}Common failure modes in training data:
- Inconsistent formatting. If 80% of your examples use {"tool": "..."} and 20% use {"function": "..."}, the model will randomly switch between them. Normalize ruthlessly.
- Unbalanced class distribution. If your agent handles 10 task types but 60% of your examples are classification tasks, the model will be biased toward classification and underperform on the others. Upsample rare categories or use stratified sampling.
- Including failed traces. Only train on outputs that were actually correct and useful. Including retried or fallback outputs teaches the model to fail in the same way the big model failed.
Merging and deployment
After training, you have two options for inference:
Keep the adapter separate. Load the quantized base model + LoRA adapter at serve time. This lets you swap adapters per task (more on this in Section 3). Frameworks like vLLM and TGI support LoRA adapter loading natively.
Merge into the base model. For maximum inference speed and simplicity:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(MODEL_ID)
lora_model = PeftModel.from_pretrained(base_model, "./checkpoints/best")
# Merge LoRA weights into base model
merged = lora_model.merge_and_unload()
merged.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
# Now quantize for deployment (AWQ or GPTQ)
# and serve with vLLM, llama.cpp, or TensorRT-LLM
Merging eliminates the (already negligible) adapter overhead at inference time. The merged model can then be further quantized for production deployment.
1.2 Distillation: teaching the student to think like the teacher
Fine-tuning teaches the model what to output. Distillation teaches it why — by transferring not just correct answers, but the probability distributions and reasoning patterns behind them.
The spectrum of distillation approaches
Hard-label distillation (the simplest form) is what we described in the fine-tuning section: generate outputs from a teacher model, use them as training targets. The student sees only the final token sequence. This is essentially supervised fine-tuning on teacher-generated data, and for many agentic tasks, it is sufficient.
Soft-label distillation transfers the full probability distribution over the vocabulary at each token position. When the teacher assigns 55% to “approve,” 30% to “escalate,” and 15% to “defer,” the student learns the relative similarity between these actions — information that hard labels destroy. The training loss combines standard cross-entropy with a KL divergence term:
import torch.nn.functional as F
def distillation_loss(student_logits, teacher_logits, labels, temperature=2.0, alpha=0.5):
"""
Combined distillation + supervised loss.
temperature: higher = softer distributions, more knowledge transfer
alpha: balance between distillation and supervised loss
"""
# Soft targets from teacher
soft_teacher = F.softmax(teacher_logits / temperature, dim=-1)
soft_student = F.log_softmax(student_logits / temperature, dim=-1)
# KL divergence on soft distributions
distill_loss = F.kl_div(soft_student, soft_teacher, reduction="batchmean") * (temperature ** 2)
# Standard cross-entropy on hard labels
supervised_loss = F.cross_entropy(student_logits, labels)
return alpha * distill_loss + (1 - alpha) * supervised_loss
Soft-label distillation requires access to the teacher’s logits, which means you need to run the teacher locally or use an API that exposes token probabilities. This limits your teacher choice to open-weight models or APIs with logprobs support.
Reasoning chain distillation is the approach that DeepSeek-R1 made famous in January 2025. Instead of distilling only the final answer, you distill the full chain-of-thought reasoning trace. The teacher generates step-by-step reasoning for each example, and the student is trained to reproduce the entire chain — including the intermediate planning and self-correction steps.
This is particularly relevant for agentic tasks that involve multi-step tool use. If your agent needs to (1) parse a request, (2) decide which tool to call, (3) construct the parameters, and (4) validate the output, distilling the full reasoning chain gives the student a much stronger foundation than distilling only the final tool call.
# Reasoning chain distillation: training data format
reasoning_example = {
"messages": [
{"role": "system", "content": "Think step by step before calling a tool."},
{"role": "user", "content": "Cancel order #12345 and notify the customer"},
{"role": "assistant", "content": """
<think>
The user wants two actions: cancel an order and send a notification.
I need to:
1. Call cancel_order with order_id "12345"
2. After confirmation, call send_notification to the customer
Let me start with the cancellation.
</think>
{"tool": "cancel_order", "params": {"order_id": "12345"}}"""}
]
}
The synthetic data pipeline
The practical bottleneck in distillation is not the training algorithm — it is the data. You need a large, diverse, high-quality dataset of teacher outputs on your task distribution. Synthetic data generation is how you get there.
The pipeline:
1 .Start with seed examples. 200–500 real examples from production or manual curation.
2. Augment inputs with the teacher. Ask the frontier model to generate paraphrased, edge-case, and adversarial variations of your seed inputs.
AUGMENTATION_PROMPT = """Given this example input for a {task_type} task:
<example>{seed_input}</example>
Generate 10 diverse variations that cover:
- Different phrasings of the same intent
- Edge cases (ambiguous inputs, missing information, conflicting instructions)
- Domain-shifted examples (different products, different customer segments)
Output ONLY the variations, one per line."""3. Generate teacher outputs. Run the frontier model on all augmented inputs with your production prompt template. This is a batch job — cost it accordingly, but remember it is a one-time expense.
4. Filter aggressively. Not every teacher output is usable. Validate that outputs parse correctly (valid JSON, correct schema), are semantically correct (spot-check a sample), and represent the quality level you want the student to learn. Discard 10–30% — this is normal and healthy.
5. Balance the dataset. Ensure reasonable coverage across task types, input lengths, and edge cases. Upsample underrepresented categories.
The resulting dataset — typically 5,000–50,000 examples for most agentic tasks — becomes your distillation training set. Apply QLoRA fine-tuning as described in Section 1.1, using this dataset.
Combining fine-tuning and distillation
The strongest results come from a two-phase approach:
Phase 1: distillation. Train the SLM on teacher-generated data to build a broad foundation for the task. This gives the model the “shape” of correct behavior.
Phase 2: fine-tuning on real data. Further train on your curated production traces or manually labeled examples. This sharpens the model on the actual distribution it will see in production and corrects any artifacts from the teacher.
This two-phase pattern mirrors how practitioners at scale (Red Hat, AT&T, and others presenting at GTC 2025) described their production workflows: distill first for coverage, then fine-tune for precision.
2. Prompt engineering for Small Language Models
Prompting a 7B model is a fundamentally different discipline than prompting GPT-4 or Claude. Smaller models have narrower effective context windows, less implicit world knowledge, weaker instruction-following on novel formats, and lower tolerance for ambiguity. Every prompt design decision that you can get away with on a frontier model — vague instructions, long preambles, implicit format expectations — will degrade SLM performance.
This section covers the specific techniques that make SLM prompting reliable in agentic systems.
2.1 The context budget
A frontier model with 128K tokens of context can absorb a messy prompt, a verbose system message, a dozen few-shot examples, and a long input document — all at once, with quality barely degrading. An SLM running on an A10G with a practical effective context of 4K–8K tokens (longer is technically supported, but quality degrades and latency increases quadratically with attention length) demands that every token earn its place.
Think of it as a budget. For a 4096-token context window with a 200-token output:
Component Token Budget System instruction 100–200 Task schema 50–100 Few-shot examples 200–500 Input document 2,500–3,000 Output headroom 200–500
If your input documents are longer than ~3,000 tokens, you need a chunking strategy or a pre-processing step (summarization, extraction) before the SLM sees them. This is not optional — it is an architectural constraint that shapes your agent design.
2.2 Prompt structure patterns
Pattern 1: schema-first prompting
The most reliable pattern for tool-calling and structured output tasks. Define the output schema before the task description, so the model “knows where it is going” from the first token of generation.
<schema>
{"tool": "string", "params": {"to": "string", "subject": "string", "body": "string"}}
</schema>
<rules>
- Output ONLY a single JSON object matching the schema above
- All string values must be non-empty
- Do not include any text before or after the JSON
</rules>
<input>
Send an email to the engineering team about tomorrow's standup being moved to 3pm
</input>
Why this works better than natural language instructions: the schema acts as a structural prior. The model begins generation already “aligned” with the expected output shape. Combined with constrained decoding (JSON mode in vLLM, XGrammar, or Outlines), this approach achieves near-100% format compliance.
Pattern 2: negative examples
SLMs benefit disproportionately from being told what not to do. Frontier models can infer boundaries from positive examples alone. Smaller models need explicit guardrails.
<correct>
{"tool": "search_orders", "params": {"customer_id": "C-4521", "status": "pending"}}
</correct>
<incorrect>
I'll help you search for pending orders for customer C-4521. Here's the tool call:
{"tool": "search_orders", "params": {"customer_id": "C-4521", "status": "pending"}}
</incorrect>
<why>
Never include conversational text. Output ONLY the JSON object.
</why>
The <incorrect> + <why> pattern reduces the most common SLM failure mode in agentic contexts: wrapping valid structured output in conversational filler that breaks JSON parsing.
Pattern 3: decision boundary prompting
For classification and routing tasks, explicitly define the decision boundaries rather than relying on the model to infer them from examples.
<task>Classify the customer intent.</task>
<categories>
BILLING: Questions about charges, invoices, payment methods, refunds, pricing
TECHNICAL: Bugs, errors, crashes, performance issues, how-to for product features
ACCOUNT: Login issues, password reset, profile changes, subscription management
SHIPPING: Delivery status, tracking, address changes, shipping options
OTHER: Anything that does not clearly fit the above categories
</categories>
<rules>
- If the message mentions BOTH billing and technical issues, classify as BILLING
- If uncertain between two categories, prefer the one listed first
- Respond with ONLY the category name, nothing else
</rules>
<input>{message}</input>
The explicit tie-breaking rules (“prefer the one listed first”) eliminate a class of non-determinism that plagues SLM classification. Without them, borderline cases oscillate between categories across runs.
2.3 Few-shot engineering
The conventional wisdom — “more examples = better” — inverts for SLMs. Here is what actually works:
1–3 examples, not 5–10. Each example consumes 100–300 tokens. Three examples at 200 tokens each consume 600 tokens — 15% of a 4K context. The marginal quality improvement from a fourth example rarely justifies the context cost.
Choose examples that define boundaries, not centroids. Your few-shot examples should cover the edges of each category, not the obvious center. An example of a billing question that looks like it could be technical is more valuable than a straightforward “How much does it cost?”
Use structured formatting consistently. Every example should follow the exact same format — same XML tags, same field order, same whitespace conventions. SLMs are sensitive to format variation in ways that frontier models are not. A single inconsistently formatted example can degrade performance on all subsequent inputs.
<example>
<input>My app keeps freezing when I try to pay</input>
<output>TECHNICAL</output>
<note>Even though payment is mentioned, the core issue is app behavior</note>
</example>
<example>
<input>You charged me twice last month</input>
<output>BILLING</output>
</example>
<example>
<input>Can I change my delivery to a different address before it ships?</input>
<output>SHIPPING</output>
</example>
2.4 Prompt templates as software artifacts
In production agent systems, prompts are not strings — they are versioned, tested, and deployed like any other code artifact. This is not a metaphor. It is a literal engineering practice.
# src/prompts/intent_classifier.py
from dataclasses import dataclass
from typing import Literal
IntentCategory = Literal["BILLING", "TECHNICAL", "ACCOUNT", "SHIPPING", "OTHER"]
SYSTEM_TEMPLATE = """<task>Classify the customer intent.</task>
<categories>
BILLING: charges, invoices, payment methods, refunds, pricing
TECHNICAL: bugs, errors, crashes, performance, how-to
ACCOUNT: login, password, profile, subscription
SHIPPING: delivery, tracking, address, shipping options
OTHER: does not fit above
</categories>
<rules>
- Overlapping billing+technical: classify as BILLING
- Uncertain: prefer earlier category
- Output ONLY the category name
</rules>"""
def build_prompt(message: str) -> list[dict]:
return [
{"role": "system", "content": SYSTEM_TEMPLATE},
{"role": "user", "content": f"<input>{message}</input>"},
]
def parse_response(raw: str) -> IntentCategory:
"""Parse and validate model output."""
cleaned = raw.strip().upper()
valid = {"BILLING", "TECHNICAL", "ACCOUNT", "SHIPPING", "OTHER"}
if cleaned not in valid:
raise ValueError(f"Invalid category: {cleaned}")
return cleaned
# tests/test_intent_classifier.py
import pytest
from src.prompts.intent_classifier import build_prompt, parse_response
from src.inference import run_slm # your inference wrapper
CASES = [
("I was charged twice for my subscription", "BILLING"),
("The app crashes when I upload a photo", "TECHNICAL"),
("How do I reset my password?", "ACCOUNT"),
("Where is my package?", "SHIPPING"),
("My payment fails every time I open the checkout page", "TECHNICAL"), # boundary case
]
@pytest.mark.parametrize("message,expected", CASES)
def test_intent_classification(message, expected):
prompt = build_prompt(message)
raw = run_slm(prompt)
result = parse_response(raw)
assert result == expected
def test_parse_rejects_invalid():
with pytest.raises(ValueError):
parse_response("I think this is a billing issue")
This pattern gives you three things that ad-hoc prompting does not: regression testing before deployment, clear separation of prompt logic from application logic, and a versioned history of what changed and when. When your SLM starts producing unexpected outputs after a model update or prompt tweak, you know exactly where to look.
2.5 SLM-specific failure modes
Understanding how small models fail differently from large ones lets you design prompts that prevent the most common issues:
Conversational bleed. The model wraps structured output in natural language (“Sure! Here’s the JSON: {…}”). Fix: explicit negative examples + constrained decoding.
Schema hallucination. The model invents fields not in the schema, or uses wrong field names. Fix: schema-first prompting + strict JSON schema validation.
Instruction decay over long inputs. Instructions at the beginning of the prompt lose influence as the input grows. Fix: repeat critical instructions both before and after the input, or use a post-input reminder tag: <reminder>Output ONLY JSON</reminder>.
Few-shot overfitting. The model copies surface patterns from examples (e.g., always using the same email address from the example). Fix: use placeholder tokens in examples (<CUSTOMER_EMAIL>) rather than realistic values.
3. Multi-model agent architecture
The most effective agentic systems in 2026 are not monolithic. They are heterogeneous: multiple models, each chosen for its operational profile, orchestrated by a routing layer that directs each subtask to the most cost-effective model capable of handling it.
This is not a theoretical pattern. It is the architecture that production agent systems converge on once they outgrow the “one model does everything” phase.
3.1 The architectural pattern
┌──────────────────────┐
│ Incoming Request |
└──────────┬───────────┘
│
┌──────────▼───────────┐
│ Router │
│ (rules / classifier │
│ / confidence gate) │
└──┬────┬────┬────┬────┘
│ │ │ │
┌────────▼┐ ┌▼────▼┐ ┌▼────────┐ ┌──────────────┐
│ SLM-A │ │SLM-B │ │ SLM-C │ │ Frontier LLM │
│ classify│ │parse │ │validate │ │ reason/plan │
│ <1B │ │ 3B │ │ 7B │ │ 200B+ │
│ ~5ms │ │~15ms │ │ ~30ms │ │ ~500ms+ │
└────┬────┘ └──┬───┘ └────┬────┘ └──────┬───────┘
│ │ │ │
└─────────┴──────────┴─────────────┘
│
┌──────────▼───────────┐
│ Response Assembly |
└──────────────────────┘
The principle is straightforward: match model capability to task complexity. SLMs handle the predictable, high-volume work. Frontier models handle the rare, complex reasoning. The router decides.
3.2 The LLM-to-SLM migration algorithm
The NVIDIA paper proposes a systematic conversion path. Here is how it translates to practice:
Step 1: Instrument. Add logging to every LLM call in your existing agent: the full prompt, the output, latency, token count, and whether the output was accepted or triggered a retry. Run this for 2–4 weeks to build a representative dataset.
Step 2: Cluster. Group the logged interactions by task similarity. You can use simple heuristics (group by system prompt or tool being called) or embedding-based clustering. You will typically find that 70–85% of interactions fall into 5–10 well-defined task clusters.
from sklearn.cluster import KMeans
from sentence_transformers import SentenceTransformer
# Embed all logged prompts
encoder = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = encoder.encode(logged_prompts)
# Cluster into task groups
kmeans = KMeans(n_clusters=8, random_state=42)
labels = kmeans.fit_predict(embeddings)
# Analyze each cluster
for cluster_id in range(8):
cluster_prompts = [p for p, l in zip(logged_prompts, labels) if l == cluster_id]
print(f"\nCluster {cluster_id}: {len(cluster_prompts)} interactions")
print(f"Sample: {cluster_prompts[0][:200]}")
# Assess: Is this repetitive? Narrow? Schema-bounded?
Step 3: Evaluate replaceability. For each cluster, ask: Is the task repetitive and predictable? Does it have a well-defined output schema? Does it require broad world knowledge? Is the output variability low? High-volume clusters with narrow scope and structured outputs are your migration candidates.
Step 4: Specialize. For each candidate cluster, fine-tune (and optionally distill) an SLM as described in Section 1. Use the logged successful outputs from that cluster as training data.
Step 5: Route. Build the routing layer that directs incoming requests to the appropriate specialist.
3.3 Router design patterns
The router is the architectural linchpin. Get it wrong and you either waste money (sending easy tasks to the frontier model) or damage quality (sending hard tasks to the SLM). Three patterns, in order of increasing sophistication:
Pattern A: rule-based routing
If your agent has a well-defined state machine — which many production agents do — simple rules suffice:
def route(task: AgentTask) -> str:
"""Route based on task type and metadata."""
# Explicit task types
if task.type in ("classify_intent", "extract_entities", "validate_format"):
return "slm_7b"
# Input length heuristic
if task.input_tokens > 4000:
return "frontier" # SLM context limits
# Default to SLM for known patterns, frontier for unknown
if task.type in KNOWN_TASK_TYPES:
return "slm_7b"
return "frontier"
Pros: zero latency overhead, fully deterministic, easy to debug. Cons: requires manual maintenance, cannot handle novel task types.
Pattern B: classifier-based routing
A tiny model (even a logistic regression on TF-IDF features, or a sub-100M parameter encoder) predicts the best specialist for each input:
from sklearn.linear_model import LogisticRegression
import pickle
class TaskRouter:
def __init__(self, model_path: str):
with open(model_path, "rb") as f:
self.classifier = pickle.load(f)
self.model_map = {
0: "slm_classifier",
1: "slm_extractor",
2: "slm_validator",
3: "frontier",
}
def route(self, input_text: str) -> str:
features = self.vectorizer.transform([input_text])
prediction = self.classifier.predict(features)[0]
return self.model_map[prediction]
Training the router classifier on your Step 2 cluster labels takes minutes and adds <5ms latency per request. This handles novel inputs better than rules because the classifier generalizes from the training distribution.
Pattern C: confidence-based fallback
This is the pattern that production systems converge on. Route everything to the SLM first. If the output confidence is below a threshold, escalate to the frontier model.
import json
import math
async def route_with_fallback(
task: AgentTask,
slm: ModelClient,
frontier: ModelClient,
confidence_threshold: float = 0.85,
) -> ModelResponse:
"""
SLM-first with confidence-gated fallback.
"""
# Step 1: Try the SLM
slm_response = await slm.generate(
task.prompt,
return_logprobs=True,
max_tokens=task.max_output_tokens,
)
# Step 2: Compute confidence from token log-probabilities
avg_logprob = sum(slm_response.token_logprobs) / len(slm_response.token_logprobs)
confidence = math.exp(avg_logprob) # convert to probability scale
# Step 3: Accept or escalate
if confidence >= confidence_threshold:
return ModelResponse(
text=slm_response.text,
model="slm",
confidence=confidence,
latency_ms=slm_response.latency_ms,
)
# Step 4: Fallback to frontier
frontier_response = await frontier.generate(
task.prompt,
max_tokens=task.max_output_tokens,
)
return ModelResponse(
text=frontier_response.text,
model="frontier",
confidence=None,
latency_ms=slm_response.latency_ms + frontier_response.latency_ms,
)
The confidence threshold needs calibration. Set it too high and you escalate too often, negating the cost savings. Set it too low and you serve low-quality SLM outputs. Start at 0.85, run A/B tests against full-frontier routing, and adjust until your quality metrics (accuracy, schema compliance, user satisfaction) are within acceptable bounds.
Important caveat: raw softmax probabilities from language models are notoriously poorly calibrated. A model can be confident and wrong. For production systems, consider training a separate calibration layer — a small classifier that takes the SLM’s output, confidence scores, and task features, and predicts whether the output is actually correct. This adds another ~5ms but dramatically improves fallback precision.
3.4 Adapter swapping: a fleet on a single GPU
LoRA enables a powerful pattern for multi-model architectures: maintain a single base model in GPU memory and swap adapters per task.
A Llama 3.1 8B base with five LoRA adapters (classifier, extractor, validator, formatter, summarizer) consumes barely more memory than a single model. Each adapter weighs 10–100MB versus 16GB for the base. At inference time, swapping adapters is a fast pointer operation, not a model reload.
# Conceptual example — framework-specific APIs vary
from vllm import LLM
# Load base model once
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
enable_lora=True,
max_lora_rank=64,
quantization="awq",
)
# Register adapters
ADAPTERS = {
"classifier": "./adapters/intent-classifier",
"extractor": "./adapters/entity-extractor",
"validator": "./adapters/output-validator",
}
# Route and generate with appropriate adapter
async def generate_with_adapter(task: AgentTask) -> str:
adapter_name = router.route(task)
adapter_path = ADAPTERS.get(adapter_name)
response = llm.generate(
task.prompt,
lora_request=LoRARequest(adapter_name, adapter_path) if adapter_path else None,
)
return response.outputs[0].text
This is the most resource-efficient implementation of the heterogeneous architecture. One GPU, one base model, multiple specialists. It works today with vLLM and several other serving frameworks.
3.5 The orchestration layer
The final piece is the orchestration logic that chains subtasks together. This is typically imperative code — not another model call — because the control flow is deterministic:
async def handle_customer_request(message: str) -> str:
"""Full agent pipeline with multi-model orchestration."""
# Step 1: Classify intent (SLM-A, <1B, ~5ms)
intent = await classify_intent(message)
# Step 2: Extract entities (SLM-B, 3B, ~15ms)
entities = await extract_entities(message, intent)
# Step 3: Route based on intent
if intent in ("BILLING", "TECHNICAL", "ACCOUNT", "SHIPPING"):
# Step 4a: Generate response with domain SLM (7B, ~30ms)
response = await generate_domain_response(intent, entities)
# Step 5: Validate output format (SLM-C, 7B, ~20ms)
is_valid = await validate_response(response, intent)
if is_valid:
return response # Total: ~70ms, ~$0.015
# Step 4b: Fallback to frontier for complex/unknown cases
response = await frontier_generate(message, intent, entities)
return response # Total: ~600ms, ~$0.10+
The total latency for the SLM path is ~70ms. The frontier fallback adds ~500ms. In the e-commerce agent example from the NVIDIA paper’s framing, 85% of requests take the fast path. The remaining 15% get frontier-quality reasoning. The system-level cost reduction is 75–85% compared to routing everything through the frontier API.
What comes next
This article covered the technical core: how to specialize SLMs through fine-tuning and distillation, how to engineer prompts that work within their constraints, and how to architect multi-model systems that combine the best of both worlds.
The underlying thesis remains: the small model is not a compromise. For the repetitive, schema-bounded, latency-sensitive tasks that constitute 70–85% of agentic workloads, it is the better engineering choice. The work is in proving that for your specific system — and this article gives you the tools to do it.
Daniel Braz is CTO at BRQ Product & Experience Studios in São Paulo, with 25+ years in technology. He writes about agentic software engineering, specification-driven development, and the systems-level challenges of deploying AI in production. His current research focuses on context degradation in agentic systems and the role of human judgment in AI-assisted development workflows.
Training Small Language Models (SLMs) for agentic systems: a practitioner’s guide was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.