Inference-Time Is All You Need

Cozy kitchen counter with AI concepts as cooking items — a RAG recipe box, Structured Output baking mold, Chain-of-Thought measuring spoons, and Tool Use utensil holder — arranged around a recipe card reading “Inference-Time Is All You Need.” — Source: Image generated using Google’s Nano Banana 2

Stop training, start engineering: the modern toolkit for production LLM systems.

The era of casual prompt hacking is over. The conversation has shifted from “How do I trick the model into giving me what I want” to “How do I engineer a system that guarantees the right output, every time?”

For too long, fine-tuning has been presented as the ultimate answer to LLM customization. More often than not, it’s a costly, time-consuming last resort that fails to solve the underlying problem. Modern LLM APIs and open-source frameworks now provide a toolkit of precision instruments — structured outputs, function calling, context engineering, and advanced reasoning — that deliver superior results with a fraction of the overhead.

Before you spend a dollar on training infrastructure, you need to master the tools that work at inference time.

From Prompt Hacking to Precision Engineering

Early interactions with LLMs felt like guesswork. We wrote elaborate prompts, stuffed them with examples, and hoped the output was usable. When that failed, the knee-jerk reaction was often to compile a dataset and finetune. This path is resource-intensive, demands deep technical expertise, and is often the wrong tool for the job.

Fine-tuning is a powerful technique for teaching a model a new behaviour, a specific style or a narrow skill. But it is unreliable for injecting factual knowledge, is a maintenance nightmare for evolving use cases and can even compromise the model’s built-in safety features. Fine-tuning is the right call for narrow, measurable, and stable tasks but the wrong one for adding knowledge or handling broad domains. The modern LLM toolkit offers better, more direct solutions for the most common challenges.

Taming Model Output With Structured API Calls

One of the most common and frustrating reasons teams turn to fine-tuning is to force reliable, structured output. Trying to get a model to consistently generate valid JSON through prompting alone is a losing battle of regex and error handling. This is now a solved problem.

Modern LLM providers have integrated structured output modes directly into their APIs. Instead of asking for JSON, you now declare that you expect JSON.

OpenAI has the response_format parameter.
Anthropic offers structured output capabilities across its latest models.
Google’s Gemini models support it out of the box.

This isn’t a prompt hint - it is a guarantee. The model is constrained at the token generation level to produce syntactically valid JSON that conforms to your requested schema. The days of parsing broken strings are over.

Here’s what this looks like with OpenAI’s json_schema mode. Setting strict=True enables constrained decoding, so the output is guaranteed to match your schema at the token level:

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {"role": "user", "content": "Extract the user's name, email, and company from: 'John Doe from Acme Corp can be reached at [email protected].'"}
  ],
  response_format={
    "type": "json_schema",
    "json_schema": {
      "name": "contact_info",
      "strict": True,
      "schema": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "email": {"type": "string"},
          "company": {"type": "string"}
        },
        "required": ["name", "email", "company"],
        "additionalProperties": False
      }
    }
  }
)

print(response.choices[0].message.content)
# {"name": "John Doe", "email": "[email protected]", "company": "Acme Corp"}

While fine-tuning can take JSON formatting accuracy from less than 5% to over 99%, these API features achieve the same reliability with zero training data. This single feature obviates one of the most common reasons for fine-tuning.

Empowering LLMs with Function Calling and Tool Use

The next level of LLM engineering is letting models call your code. Function calling or “tool use”, allows an LLM to interact with the outside world. Instead of just generating text, the models can invoke code, query a database or call an external API.

This capability was first introduced by OpenAI in mid 2023 and since then, other providers have also shipped their own robust implementations— Anthropic’s tool use was generally available in May 2024 and Google has integrated similar capabilities into its Gemini models. This has transformed LLM from a pure language processor into the reasoning engine of a larger system.

You define a set of tools (your functions) and provide their schemas to the model. When a user prompt requires an action, instead of trying to answer directly, it generates a JSON object containing the name of the function to call and the arguments to pass. The core mechanic is the same across providers — OpenAI returns a tool_calls array on the response, Anthropic returns a tool_use content block — but the ergonomics differ enough that switching providers means rewriting your tool-handling code.

Here’s how you might define a tool that lets an LLM query your database:


tools = [
    {
        "type": "function",
        "function": {
            "name": "query_metrics_db",
            "description": "Run a read-only SQL query against the metrics database",
            "parameters": {
                "type": "object",
                "properties": {
                    "sql": {
                        "type": "string",
                        "description": "A read-only SQL SELECT query",
                    },
                    "limit": {
                        "type": "integer",
                        "description": "Max rows to return (default 100)",
                    },
                },
                "required": ["sql"],
            },
        },
    }
]

# User asks: "What was our p95 latency for the /api/predict endpoint last week?"
# Model responds with:
# tool_calls=[{"function": {"name": "query_metrics_db",
#   "arguments": '{"sql": "SELECT date, p95_latency_ms FROM endpoint_metrics
#     WHERE endpoint = \'/api/predict\' AND date >= CURRENT_DATE - INTERVAL 7 DAY", "limit": 7}'}}]

# Your code executes this against a read-only replica, returns the rows,
# and the model summarizes the results in natural language.

This is how you build agents. The LLM handles understanding user intent and parsing unstructured text into structured API calls. Your code handles the business logic. For specialized tasks, Google has even released dedicated open models like FunctionGemma, trained specifically for this capability.

RAG Over Retraining

A common misconception about fine-tuning is that it is for teaching the model new knowledge. It is not. Fine-tuning adjusts the model’s weights to learn new behaviours — it is an inefficient and unreliable way to store facts. The correct way to give a model access to dynamic, proprietary, or domain-specific information is through Retrieval-Augmented Generation (RAG).

RAG is a simple concept: when a user asks a question, you first retrieve relevant documents from a knowledgebase (like a vector database such as Pinecone, Chroma, Weaviate or Milvus) and inject that context directly into the prompt. The LLM uses this provided information to formulate its answer. This approach has several massive advantages:

Factual Accuracy: The model answers based on the provided data, dramatically reducing hallucinations.
Up-to-Date Knowledge: You can update your knowledge base in real-time without ever retraining your model.
Source Attribution: You know exactly which documents were used to generate an answer, allowing for citations and verifications.

A minimal RAG pipeline in practice looks something like this

from openai import OpenAI
import chromadb

client = OpenAI()
db = chromadb.Client()
collection = db.get_collection("internal_docs")

def ask(question: str) -> str:
    # 1. Retrieve: find the most relevant chunks
    results = collection.query(query_texts=[question], n_results=5)
    context = "\n---\n".join(results["documents"][0])

    # 2. Generate: let the model answer grounded in retrieved context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer based on this context:\n{context}"},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

The performance gains are not trivial. An advanced RAG framework tested on proprietary enterprise data showed a 15% increase in Precision@5 and a 13% gain in Recall@5 compared to baseline model. On qualitative metrics, its “Faithfulness” score jumped from 3.0 to 4.6 out of 5. RAG makes your private data accessible to the LLM without baking it into the weights.

Thinking Out Loud: Advanced Reasoning Techniques for Complex Tasks

For problems that require multi-step reasoning, a simple prompt often isn’t enough. Advanced techniques like Chain-of-Thought (CoT) prompting guide the model to break down a problem and “think” through the steps before giving a final answer.

Instead of asking for the solution directly, you ask the model to explain its reasoning process first. This simple shift can dramatically improve performance on complex logical, mathematical and symbolic reasoning tasks. In practice, implementing CoT can be as simple as modifying your system prompt.

# Without CoT — model jumps straight to an answer
messages = [
    {"role": "user", "content": "Should we shard the users table? We have 50M rows, 200 writes/sec, read-heavy workload."}
]

# With CoT — model reasons through the problem first
messages = [
    {"role": "system", "content": "Think through the problem step by step before giving your recommendation. Consider trade-offs explicitly."},
    {"role": "user", "content": "Should we shard the users table? We have 50M rows, 200 writes/sec, read-heavy workload."}
]

Variations on this theme, like Zero Shot CoT (“let’s think step by step”) and self-consistency (generating multiple reasoning paths and taking a majority vote), further boost accuracy. While these methods increase latency and computational cost, they are purely inference-time techniques that unlock sophisticated problem-solving abilities without any model modifications.

When the Model Thinks For Itself

Chain of Thought prompting asks the model to show its work. Reasoning models internalize that process entirely — generating thousands of hidden “thinking tokens” before producing a visible response. OpenAI’s o3, Anthropic’s extended thinking on Claude and Google’s Gemini 2.5 Pro thinking mode all ship this capability, and the results speak for themselves: on the AIME 2024 math competition, Claude 4 Sonnet scores 40.7% in standard mode and jumps to 77.3% with extended thinking. o3 hits 91.6%. No amount of prompt engineering closes that gap on a standard model.

The cost structure changes accordingly. Thinking tokens are billed as output tokens and a single complex query can burn through 10k-50k of them. An analysis found that reasoning models use roughly 30x more energy than standard ones. This makes routing essential in production — a lightweight classifier that sends only genuinely hard queries (multi-step reasoning, math, complex code generation) to a reasoning model while routine queries hit a standard model. This can cut your inference bill drastically without sacrificing quality where it matters.

Picking an Orchestration Framework

These techniques-structured output, function calling, RAG and CoT are the building blocks of modern LLM applications. To assemble them into coherent systems, orchestration frameworks like LangChain, Mirascope and Haystack handle the plumbing — chaining LLM calls with data retrieval steps, tool executions and state management. They allow you to build complex, multi-step agents in a modular and reusable way. Instead of writing brittle, monolithic scripts, you can design clear pipelines:

User query comes in.
A retrieval step (RAG) fetches relevant data.
An LLM call with a CoT prompt synthesizes the data and plans the next actions.
If needed, a tool-use step executes a function call.
A final LLM call generates the user-facing response.

Choosing a framework depends on how much abstraction you want. The key is to adopt a component-based mindset, building systems from these well-defined blocks.

Composing the Toolkit

The question isn’t which technique to use. In production, the answer is almost always several at once. A system that retrieves documents from a vector store, enforces structured output on the response, and routes complex queries to a reasoning model while simple ones hit a standard model — that’s not an advanced architecture. That’s table stakes. The best-performing AI systems aren’t bigger models, they are the better-composed ones. AlphaCode 2 did not beat competitive programmers by scaling a single model — it composed generation, filtering, clustering and scoring into a system that reached the 85th percentile on Codeforces.

These techniques do not just coexist — they reinforce each other. Structured output makes RAG pipelines more reliable because you get a validated JSON object with source citations and confidence scores, not a blob of unstructured text you have to parse after the fact. RAG makes Chain-of-Thought more trustworthy because the model reasons over retrieved evidence instead of its own parametric memory. Tool use makes structured output more powerful because function calls return typed data that feeds cleanly into the next step. Each layer tightens the guarantees of the one that follows it.

Fine-tuning composes too — but it solves a different class of problem. RAFT is a good example: it fine-tunes models specifically to work better with retrieved documents, training them to cite verbatim from relevant chunks and ignore distractor documents mixed into the context. The result isn’t a replacement for RAG — it’s a model that’s better at the generation in RAG. Similarly, teams fine-tune small embedding models to sharpen retrieval quality — a Databricks study showed a 7% accuracy boost from just 6,300 training samples on a consumer GPU. Others fine-tune lightweight classifiers as query routers, deciding which requests need a reasoning model and which don’t — vLLM’s semantic router cut latency by 47% and token usage by 48% with this approach. In each case, fine-tuning earns its place as one more composable block, sharpening a specific component rather than trying to replace the system around it.

The systems that work in production are the ones that compose these blocks predictably. The discipline isn’t picking the right tool. It’s wiring them together so the guarantees stack: structure, grounding, action, reasoning — each layer doing its job so the next one can do its.

Inference-Time Is All You Need was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.