The Harness Is The Product Now

By Kristiyan Ivanov · Published May 11, 2026 · 1 min read · Source: Level Up Coding

Member-only story

The Harness Is The Product Now

Why the most important code in your AI stack isn’t the model

Here’s an uncomfortable benchmark result from late 2025: the same Claude Opus model scored 77% on a hundred-feature product spec when run through Claude Code, and 93% when run through Cursor. Same weights. Same prompt. Same week. Sixteen percentage points evaporated and reappeared based on nothing but the software wrapping the model.

Endor Labs ran the same experiment with GPT-5.5: 61.5% functional correctness inside OpenAI’s own Codex CLI, 87.2% inside Cursor. That’s a 25-point swing from a runtime change. LangChain went from outside the top 30 on Terminal-Bench 2.0 to rank 5 by changing only the infrastructure around their LLM — same model, same weights. CORE-Bench saw Claude Opus jump from 42% with a minimal scaffold to 78% inside Claude Code’s full setup.

If you’ve spent the last two years arguing about which model is “smartest,” I have bad news. You’ve been measuring the wrong thing.

The frontier model is no longer where the interesting engineering happens. The harness is.

What we actually mean by “harness”

The term has been kicking around AI engineering circles since at least 2023, but it crystallized in 2025–2026 as the labs themselves…

This article was originally published on Level Up Coding and is republished here under RSS syndication for informational purposes. All rights and intellectual property remain with the original author. If you are the author and wish to have this article removed, please contact us at [email protected].