Member-only story
The Harness Is The Product Now
Kristiyan Ivanov9 min read·Just now--
Why the most important code in your AI stack isn’t the model
Here’s an uncomfortable benchmark result from late 2025: the same Claude Opus model scored 77% on a hundred-feature product spec when run through Claude Code, and 93% when run through Cursor. Same weights. Same prompt. Same week. Sixteen percentage points evaporated and reappeared based on nothing but the software wrapping the model.
Endor Labs ran the same experiment with GPT-5.5: 61.5% functional correctness inside OpenAI’s own Codex CLI, 87.2% inside Cursor. That’s a 25-point swing from a runtime change. LangChain went from outside the top 30 on Terminal-Bench 2.0 to rank 5 by changing only the infrastructure around their LLM — same model, same weights. CORE-Bench saw Claude Opus jump from 42% with a minimal scaffold to 78% inside Claude Code’s full setup.
If you’ve spent the last two years arguing about which model is “smartest,” I have bad news. You’ve been measuring the wrong thing.
The frontier model is no longer where the interesting engineering happens. The harness is.
What we actually mean by “harness”
The term has been kicking around AI engineering circles since at least 2023, but it crystallized in 2025–2026 as the labs themselves…