Start now →

High-Frequency AI Based Trading on Crypto in 2026

By Vlad Benkovskyi (codefather.dev) · Published May 3, 2026 · 71 min read · Source: Cryptocurrency Tag
TradingAI & CryptoMarket Analysis
High-Frequency AI Based Trading on Crypto in 2026

High-Frequency AI Based Trading on Crypto in 2026

Vlad Benkovskyi (codefather.dev)Vlad Benkovskyi (codefather.dev)56 min read·Just now

--

A Python + Rust Polyglot, with AI/MCP in the Research Loop

Press enter or click to view image in full size

Length: ~15 000 words / 60-minute read. Designed for non-linear consumption — jump to §5 (market making) or §8 (AI auto-research) if those are why you’re here.

Table of contents

  1. Why this article exists
  2. HFT 2026: the strategy taxonomy
  3. The Python + Rust polyglot
  4. Crypto microstructure 2026
  5. Market making, the deep dive
  6. Other HFT strategies on crypto
  7. Trading automation via MCP / GenAI connectors
  8. Backtesting and auto-research with AI
  9. Production concerns
  10. What classical chart-reading still teaches the algo trader
  11. The 2026 outlook
  12. Methodology, sources, reading list

TL;DR — for the reader who needs the bottom line in 60 seconds

The architecture. A small algo shop in 2026 should write the latency-sensitive path (WebSocket decode, orderbook, signal compute, order encode) in Rust and the orchestration / research / backtest / monitoring path in Python, bridged by PyO3. The polyglot stack achieves sub-millisecond tick-to-trade on commodity AWS hardware, which is competitive everywhere except the absolute tier-1 latency-arbitrage frontier where co-located C++ still wins. The mid-tier is where most new HFT firms in 2026 are actually being founded.

The strategies that pay. Cross-exchange market making with Avellaneda-Stoikov skew quoting and a hedging pipe; perpetual-spot basis with funding-rate harvesting; targeted triangular arbitrage on tier-3 venues during liquidation cascades. Pure latency arbitrage at the tier-1 level is structurally hard for a small team. Statistical arbitrage on alt-coin pairs is largely arb’d-out.

The AI part. AI agents wired through the Model Context Protocol have crossed into the research and backtest layer — as hypothesis generators and parameter screens — but not into live execution. Every agent action is a draft for human review; the kill-rate of agent-proposed strategies is a direct productivity metric.

The discipline. Walk-forward backtesting + Monte Carlo perturbation + combinatorial purged cross-validation are the antidote to data-snooping. Tail-latency monitoring at p99/p999, hard inventory caps, kill switches, and a monthly disaster-recovery drill are the antidote to operational disasters. The Hyperliquid HLP / JELLYJELLY incident in March 2025 (a roughly $13.5M unrealised loss) is the canonical 2025 lesson on cornered single-venue MMs.

The bet. The polyglot stack is durable; the venue list is not. Bet on architecture, not on a name.

§1 — Why this article exists

Three structural shifts have made the textbook HFT picture from the late 2010s obsolete for any shop that isn’t sitting on a co-located rack inside a Mahwah or Aurora datacenter.

First, crypto venues are now the primary retail-accessible HFT habitat. Binance, Bybit, OKX, Hyperliquid, dYdX v4, Aevo, Coinbase — fragmented across CEX, DEX, perp DEX, and L2, with native perpetuals, 24/7 markets, and matching engines that run from sub-millisecond on the centralised side down to one-block latency on the on-chain side. The classical equities microstructure literature still applies. The operating environment is alien to it.

Second, polyglot Python + Rust has displaced the C++ monoculture in the mid-tier. Rust handles the hot path — WebSocket decode, orderbook update, signal compute, FIX or binary egress. Python handles orchestration, ML serving, research notebooks, monitoring, and the increasingly important AI-research layer. This is not a compromise. For teams of fewer than a dozen engineers, it is the better architecture. The talent pool is wider, the dev velocity is higher, and the marginal microsecond C++ might buy you is invisible in crypto where the matching engines themselves run at 100 µs to 1 ms tick-to-trade.

Third, AI agents wired through the Model Context Protocol have crossed into the research and backtest layer. Not as executors — never as executors — but as hypothesis generators, parameter-grid screens, and literature ingesters. The kill-rate of agent-proposed strategies is the new productivity metric for a research team, and most of the workflow that used to take a quant analyst a week now takes a properly-scoped MCP agent an afternoon.

This article is a synthesis. The theory sections lean heavily on a personal trading library: Irene Aldridge’s High-Frequency Trading, her revised High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading Systems, Ernest Chan’s Algorithmic Trading: Winning Strategies and Their Rationale and Quantitative Trading, Valeriy Zakamulin’s Market Timing With Moving Averages, Adam Grimes’s The Art and Science of Technical Analysis, Bill Williams’s Trading Chaos, John Murphy’s The Visual Investor, plus a stack of recent engineering books — Latency: Reduce Delay in Software Systems, Building Generative AI Services with FastAPI, Hands-On Machine Learning with scikit-learn and PyTorch, Machine Learning Platform Engineering, Time Series Analysis with Python Cookbook, Time Series Forecasting Using Foundation Models, Mastering Software Architecture, Architecting AI Software Systems, and Rust for Blockchain Application Development. Where the books cover a topic, citations are inline. The 2026-specific items — current fee schedules, the latest MCP spec, recent papers, the Hyperliquid March 2025 incident — come from dated internet sources and are flagged as such.

Who this article is for. Three readers, in order:

What this article is not. It is not a tutorial. It will not teach you Rust, will not walk you through your first backtest, and will not tell you that you can get rich quoting both sides of a thin alt-coin. It is a map of the durable architectural patterns hiding under the venue-of-the-month noise, and a reading list dense enough that you can spend the next year filling in the gaps from the corpus rather than from random Twitter threads.

§2 — HFT 2026: the strategy taxonomy

Strategies in the HFT zoo split into four families. The boundaries blur — a real production system usually runs two or three at once — but each family has a distinct mathematical signature and a distinct latency profile, and conflating them is the most common source of mis-architected first systems.

Latency-driven

These are the strategies the public hates. Latency arbitrage is the canonical example: the same instrument is priced fractionally differently across venues for tens of microseconds, and you race to capture it. Aldridge frames latency arbitrage as the headline example of HFT-as-controversy: in the Practical Guide she writes that “latency arbitrage is often pinpointed by the opponents of HFT as the most direct example of the technological” race being problematic for fair markets, then proceeds to show that without latency arbitrage, prices don’t converge across venues and the market is less efficient.

Three subvariants matter for crypto:

Liquidity-providing

You quote a two-sided market and earn the spread, minus inventory cost, minus adverse selection. This is market making and gets its own deep-dive in §5. The mathematics goes back to Glosten and Milgrom (1985), whose adverse-selection model showed that “in the presence of a large number of informed traders, a market maker will set unreasonably high spreads in order to break even” — a formulation paraphrased in Aldridge HFT. The Avellaneda-Stoikov 2008 paper is its operational descendant; in 2024 Stoikov published Market Making in Crypto (Stoikov et al., SSRN 5066176, accessed May 2026) which adapted the framework to crypto perpetual contracts and built it on top of the open-source Hummingbot platform.

Statistical

Cointegrated pairs, lead-lag relationships, mean-reverting baskets. In Aldridge HFT the relevant chapter defines cointegration as “a popular technique used for optimal portfolio construction, hedging, and risk management” — the “contemporaneous or lagged effect of one variable on another.” The distinction from correlation is subtle and load-bearing: two cointegrated series can have low day-to-day correlation but a stable long-run relationship that mean-reverts. Pairs trading lives or dies on whether the spread is genuinely cointegrated (Engle-Granger or Johansen test) versus merely correlated (which decays whenever volatility regimes shift).

In crypto 2026, the durable stat-arb edge has been basis — long spot, short perp, harvest the funding rate spread — rather than classical pairs trading. Pairs trading on alt-coins has been crowded out by AMM-flow noise and by the speed at which alts come and go from venue lists.

Predictive ML

Regressors trained on order-book microstructure features to predict next-tick mid-price; classifiers tagging market regime; reinforcement-learning policies for queue-position management. Aldridge HFT makes the point bluntly: “‘neural network’ is sometimes perceived to signal advanced […] of a high-frequency system. In reality, neural networks are built […to] simplify algorithms dealing with econometric estimation.” The book is right and the framing has aged well — neural networks in HFT are signal sources, not strategies.

The 2026 update is foundation models for time-series. Time Series Forecasting Using Foundation Models opens by saying the transformer architecture “was proposed [for natural language but] is now applied to forecasting … from a time-series forecasting point of view.” For an HFT desk in 2026, the practical question is not whether to use a transformer — the question is when a 100M-parameter transformer beats a 200-feature LightGBM on tick data. The answer in the corpus and in the recent literature is: rarely, and only with carefully designed input encodings.

A comparative fit table

Family Capital req Latency req Team-size fit Crypto-2026 edge Latency arb (tier-1) High Extreme 5+ engineers, co-lo Shrinking Latency arb (tier-3 / cross-venue) Medium High 2–3 engineers Real Triangular arb Low–Medium High 1–2 engineers Tier-3 only Market making (Avellaneda-Stoikov) Medium Medium 2–3 engineers Strong Cross-exchange MM Medium High 2–3 engineers Strong Statistical arb (basis) Medium Low 1–2 engineers Crowded but durable Funding-rate arb Medium Low 1 engineer Steady Liquidation hunting (perps) High Medium 2–3 engineers Real, data-heavy Predictive ML signals Medium Medium 2 engineers + 1 data scientist Hard but possible

A two-person Rust+Python shop that picks two of these families and ignores the rest will outperform a six-person shop that tries all of them. Specialisation is not glamorous but it is profitable.

§3 — The Python + Rust polyglot

The mid-tier has consolidated around a polyglot stack. The reasons are unromantic and well-documented in the corpus.

Why pure C++ lost ground

Pure C++ still wins at the very top end — the sub-microsecond co-located equities desks. It loses in the mid-tier because three things changed in the last decade. The talent pool of 25-year-olds who can ship safe production C++ has shrunk. Dev velocity in C++ is genuinely poor for a small team — what takes a week in C++ takes a day in Rust + Python. And the marginal microsecond C++ might buy you is invisible in crypto, where the matching engines themselves run at 100 µs to 1 ms.

Rust gives you the predictability — no GC pauses, no allocator stalls in the wrong moment, exhaustive pattern matching so the compiler catches the off-by-one before it reaches a fill. Python gives you the velocity — every research notebook, every backtest, every monitoring dashboard, every MCP agent is faster to write in Python than in any compiled language.

Why pure Python isn’t enough

The Python Global Interpreter Lock is the most-cited reason and one of the most-misunderstood. Building Generative AI Services with FastAPI makes the point concisely: “In Python, the CPU time can be allocated to only one task at any moment because of Python’s Global Interpreter Lock (GIL). Python’s GIL allows only one thread” to actively execute Python bytecode at any given instant. For a CRUD web service this is a non-issue — async I/O carries the load. For HFT, two consequences matter:

  1. You cannot get true CPU parallelism in pure Python. No matter how clever your asyncio is, two CPU-bound coroutines cannot run on two cores. The orderbook decoder and the signal computer both want CPU; they cannot share Python.
  2. GC pauses and allocator behaviour are uncontrolled. A dict resize in the wrong moment costs you 200 µs of jitter. The reference-counting GC stalls on cyclic deallocation. Sub-millisecond tail-latency in pure Python is not achievable in 2026.

The polyglot answer is to put everything CPU-bound or jitter-sensitive in Rust, and use Python as the conductor.

Tail-latency theory

Latency: Reduce Delay in Software Systems puts it concisely: “percentiles tail latency [is what we mean] because the percentiles are at the tail of the latency distribution. If you have measured latency, you’ve almost certainly observed the tail latency but dismissed it” as noise. This is the load-bearing intuition for HFT operators. The mean is always fine; you die on the tails. p99 and p999 are the metrics that move P&L. If you only monitor the mean, you will never see the tail latency that adversely-selects you on every trend day.

The corpus’s framework decomposes latency into named segments — network, kernel, user-space, application, wire — and assigns a budget per segment. For a crypto HFT system that wants 1 ms tick-to-trade, the budget looks roughly like this:

Segment Budget Implementation NIC arrival → kernel 0–5 µs Linux io_uring, kernel-bypass not necessary at 1 ms WebSocket frame decode 5–15 µs Rust + simd-json Orderbook update 1–5 µs Rust, lock-free, custom Signal compute 5–30 µs Rust, SIMD where possible Strategy decision (Python branch) 100–1000 µs Python asyncio, called via PyO3 Order encode 5–15 µs Rust NIC send 0–5 µs Same as arrival Total tick-to-trade ~150–1100 µs

Two observations. First, the Python branch is by far the largest segment but it is not on the signal-to-quote path — quote updates happen in Rust based on the most recent Python-issued parameters. Second, none of these numbers are aspirational; they are achievable on a 2024-vintage AWS c7i instance with no kernel-bypass tricks.

Memory allocator analysis

Latency: Reduce Delay in Software Systems covers memory pools as a fundamental latency-hiding technique: “instead of allocating and deallocating memory to perform some work, you borrow memory from a memory pool. Of course, memory pools have similar challenges to thread pools, where sizing the memory pool” must match the working set. For an HFT engine, three allocator choices matter:

For a Rust HFT engine, the practical answer is: jemalloc as the global allocator, plus per-component arenas for orderbook nodes and fill records, plus bounded ring buffers for IPC. The latency book’s framework — “size the pool to your working set, not your peak” — becomes operational discipline.

PyO3 deep-dive

PyO3 is the Rust ↔ Python bridge. A minimal example:

// src/orderbook.rs
use pyo3::prelude::*;
use pyo3::exceptions::PyValueError;

#[pyclass]
pub struct OrderBook {
bids: Vec<(f64, f64)>,
asks: Vec<(f64, f64)>,
}
#[pymethods]
impl OrderBook {
#[new]
fn new() -> Self {
Self {
bids: Vec::with_capacity(1024),
asks: Vec::with_capacity(1024),
}
}
fn apply_delta(&mut self, side: &str, price: f64, qty: f64) -> PyResult<()> {
if !price.is_finite() || price <= 0.0 {
return Err(PyValueError::new_err("non-finite or non-positive price"));
}
let book = if side == "bid" { &mut self.bids } else { &mut self.asks };
// merge / replace / remove logic; lock-free path elided for brevity
Ok(())
}
fn microprice(&self, depth: usize) -> Option<f64> {
if self.bids.is_empty() || self.asks.is_empty() {
return None;
}
// size-weighted midprice over the top-N levels
let sum_bids: f64 = self.bids.iter().take(depth).map(|(p, q)| p * q).sum();
let sum_asks: f64 = self.asks.iter().take(depth).map(|(p, q)| p * q).sum();
let qty_bids: f64 = self.bids.iter().take(depth).map(|(_, q)| q).sum();
let qty_asks: f64 = self.asks.iter().take(depth).map(|(_, q)| q).sum();
Some((sum_bids + sum_asks) / (qty_bids + qty_asks))
}
}
#[pymodule]
fn ob_core(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_class::<OrderBook>()?;
Ok(())
}
# strategy/loop.py
import ob_core

book = ob_core.OrderBook()
async for msg in ws_stream:
book.apply_delta(msg.side, msg.price, msg.qty) # ~1 µs in Rust
mp = book.microprice(5) # ~2 µs
if mp is not None:
await quote(mp) # Python branch

Three traps to know about:

  1. Refcount leaks across the FFI. PyO3 hands out Py<T> wrappers that the Python side reference-counts. If your Rust code stores a Py<T> and never drops it, Python's GC cannot collect — leaks accumulate slowly and surface as memory growth on day three.
  2. Panic propagation. A Rust panic across the FFI becomes a Python exception. This is correct, but it means your Rust hot-path unwrap() becomes a Python RuntimeError at the worst possible moment. Use Result exhaustively in the hot path; never unwrap.
  3. GIL hold time. Anything in #[pymethods] runs while holding the GIL. If you do CPU-heavy work, release the GIL with py.allow_threads(|| { ... }) so other Python threads can progress.

Inter-process communication

Three IPC mechanisms, with measured roundtrip latencies on a single host:

Mechanism Roundtrip Throughput When Shared-memory ring buffer (e.g. iceoryx2, hand-rolled with memmap2) 5–10 µs Very high Ingestor → strategy on same host ZeroMQ (zmq Rust + pyzmq Python) 10–50 µs High Strategy → executor across processes gRPC 500–2000 µs Moderate Control plane only — not the hot path

The unsexy truth: most teams should start with ZeroMQ. Shared-memory is a year-three optimisation. gRPC has its place — for management endpoints, config reload signals, and metrics — but never on the trade path.

Async runtime selection

Tokio is the default. For aggressive thread-per-core architectures, glommio (based on Linux io_uring) gives you better tail latency at the cost of less ecosystem. The 2024 onwards trend is tokio-uring, which gives you io_uring underneath the standard tokio ergonomics. Choose tokio for almost everything; reach for glommio only when p999 matters more than dev-velocity.

Crate picks

Concrete picks with rationale, current as of 2026:

A “learn this much Rust” curriculum for the Python quant

A pragmatic minimum to be productive on a polyglot HFT stack — about three weeks of deliberate practice:

  1. Ownership, borrowing, lifetimes — the language-defining concepts (week 1)
  2. Result, Option, error propagation with ? (week 1)
  3. Tokio basics — async fn, Mutex, Notify, channel (week 2)
  4. PyO3 — exporting types, error mapping, GIL handling (week 2)
  5. Profiling — perf, flamegraphs, allocator instrumentation (week 3)

Skip macros, advanced traits, async-trait gymnastics until you have shipped something. The 80% you need is the boring 20% of the language.

Polyglot-stack failure modes

Failure Symptom Cure GIL-induced pause in async loop Strategy decision latency spikes p99 → p999 ratio Move CPU-bound work into Rust; release GIL with allow_threads FFI panic propagates as RuntimeError Strategy crashes on first malformed message Result exhaustively; never unwrap in hot path Refcount leak across PyO3 Memory grows over days Audit Py<T> lifetimes; use weak references where possible Allocator stalls p99 latency drift over hours Switch to jemalloc; pre-allocate pools WS reconnect bug Stale orderbook for 200 ms after disconnect Chaos-monkey the ingestor; force gap-recovery in tests

§4 — Crypto microstructure 2026

A crypto venue in 2026 is a textbook continuous-double-auction matching engine wrapped in three layers of API. Practical Guide describes the matching engine archetype: “a large market sell order placed earlier by algorithm B arrives at the trading venue’s matching engine. t = 12:13:01:005618: A market sell order placed earlier by algorithm C arrives” — strict price-time priority, deterministic ordering, microsecond-resolution timestamps. The physics is the same as in equities. Everything around it is venue-specific.

Fee schedules — the table that drives strategy economics

Maker-taker fees are the single biggest determinant of which strategies a venue can host. Numbers below are current as of May 2026, drawn from each venue’s published fee page; verify before sizing capital because the schedules change quarterly.

Venue Spot maker / taker Perp maker / taker Notes Binance 0.075% / 0.075% 0.02% / 0.05% BNB discount available; VIP tiers Bybit 0.10% / 0.10% 0.02% / 0.055% BIT discount; aggressive MM rebates at top tiers OKX 0.08% / 0.10% 0.02% / 0.05% VIP 8 maker fee can hit -0.01% (rebate) Hyperliquid n/a (perps focus) 0.015% / 0.045% No maker rebate but lowest taker among major venues; see Hyperliquid Docs — Fees dYdX v4 n/a 0% / 0.05% (base) Designed to attract MMs; epoch-based rebates Coinbase 0.40% / 0.60% (retail) → near-zero (Advanced) 0.02% / 0.05% (US perps) Tiered

For a market-making strategy quoting a 1 bp spread, the difference between a 0.015% and a 0.10% taker fee is the entire P&L. Strategy economics are venue economics. The decision of which venues to support is a business decision, not an engineering one, and it is the single highest-leverage decision the team makes.

Order types you actually use

The marketing list of order types is ten items long. The list you actually use is six.

Tick size and lot precision

Tick size — the minimum price increment — determines whether MM strategies are viable on a given symbol. A tick of 0.01 USDT on a 60 000-USDT symbol is 0.0166 bp; a tick of 0.1 is 0.166 bp. The ratio of bid-ask spread to tick size determines whether you can quote profitably with a one-tick edge or need to skip levels. Most major venues use tighter ticks for higher-volume symbols and looser ticks for lower-volume; the practical effect is that the thin venues you’d want to MM are exactly the venues with the worst tick economics.

Lot precision is the symmetric problem on quantity. A precision of 0.001 BTC on a 60 000-USDT symbol is 60 USDT — an MM that wants to quote 30 USDT per side simply cannot.

Sequence-gap recovery

Every WebSocket market-data feed eventually gaps. The book updates arrive with sequence numbers, and when a sequence is missing, the venue specifies a recovery sub-protocol — Binance’s is different from Bybit’s, OKX’s is different again. Each protocol works correctly on the happy path. Each is broken on at least one edge case. Build a chaos-monkey for your own ingestor and force-test every gap-recovery path. The reconnect logic will be the source of your worst-ever P&L disaster if you don’t.

The pragmatic pattern: maintain a snapshot ID and a sequence counter. On gap detection, drop all open quotes within 5 ms (faster than any informed flow can pick you off), then refetch the snapshot via REST, then re-subscribe to the diff stream from the snapshot’s sequence. Anything else is theatre.

Funding rates as a structural HFT input

Perpetual swaps have no expiry. To keep the perp price tethered to spot, venues implement a funding rate: long holders pay short holders (or vice-versa) at a fixed cadence. The standard formula is approximately:

funding_rate = clamp(premium_index + interest_rate_diff, -0.75%, +0.75%)
where premium_index = (perp_mid - spot_index) / spot_index

The cadence varies — Binance and Bybit settle every 8 hours, Hyperliquid and dYdX v4 every 1 hour. The 1-hour cadence creates more granular carry opportunities and more frequent timing-edge moments around the funding snap.

A market-neutral funding harvest — long spot + short perp when funding is positive — is one of the few reliable carry strategies in crypto. It is also crowded, so your edge has to come from latency on entry/exit and sizing into the cap. The “edge” is timing the snap, not the steady-state carry.

CEX-DEX latency profile

A spot trade on Binance settles in microseconds at the matching engine. A swap on Uniswap v3 settles in one Ethereum-block — in 2026 that’s 12 seconds on mainnet, 2 seconds on Base, sub-second on Arbitrum and Optimism. Bridges between chains add 1–10 minutes of finality, depending on bridge.

The implication for HFT: pure on-chain HFT is a different sport with different latency rules. Cross-CEX-DEX strategies live on the bridges. And the rise of intent-based protocols — UniswapX (Dutch-auction off-chain orders, executed by competing fillers) and CoW Swap (batch auctions with coincidence-of-wants matching, settled by competing solvers) — has shifted the on-chain liquidity model from AMM-only to AMM-plus-solver-network. UniswapX in 2026 “handles gasless single-chain swaps, cross-chain trades, and MEV-protected execution for retail wallets” (eco.com — UniswapX guide, accessed May 2026); CoW Swap “collects orders for roughly 30 seconds, bundles them into a batch, and auctions the right to settle the batch off to a network of competing solvers” (cow.fi documentation, accessed May 2026).

MEV taxonomy for the trader’s perspective

MEV is both adversary (your transaction can be sandwich-attacked) and opportunity (you can be the one extracting MEV). The taxonomy as of 2026:

In 2026 most MEV has moved off mainnet to L2s. Base (46.58% of L2 TVL) and Arbitrum (30.86%) dominate (blockeden.xyz — L2 consolidation, accessed May 2026), and on L2s the sequencer holds significant MEV power because it controls transaction ordering — which is currently centralised on most rollups.

Queue position and adverse selection in crypto

Practical Guide explains the queue mental model: “this queue can be thought of as a line for airport check-in. Unlike the airport line, however, the queue often has a finite length or capacity; therefore, any quote arrivals” beyond the cap get rejected or pushed to the next price level. The implication for crypto: you do not race to the front of every level. You race to be first to a level the market is about to revisit.

Aldridge HFT makes the adverse-selection point that goes with it: “Harris and Panchapagesan [2002] show that market makers able to fully observe the information in [the limit] order book can extract abnormal returns, or ‘pick off’ other limit-order traders” who haven’t moved. In crypto the picking-off is more aggressive than in equities because retail flow is louder, more concentrated, and travels in regime-correlated waves. A passive limit order at the front of the queue during a high-vol moment is a target.

§5 — Market making, the deep dive

This is the load-bearing strategy section. The corpus is dense here — Aldridge and the Practical Guide together are essentially a textbook on the topic, and the recent Market Making in Crypto paper by Stoikov and his coauthors provides the 2024–2026 update.

The bid-ask spread as compensation

A market maker quotes a buy and a sell. The fundamental theorem is that the spread compensates the MM for two costs: inventory holding cost and adverse-selection cost. The Glosten-Milgrom 1985 model formalised the second component; in Aldridge HFT the result is paraphrased as: “one outcome of Glosten and Milgrom (1985) is that in the pre[sence of] a large number of informed traders, a market maker will set unrea[sonably] high spreads in order to break even.” If a fraction of incoming flow is informed (i.e. trades on private information about the future price), the MM loses on every informed trade and must widen spreads against uninformed flow to compensate.

The Avellaneda-Stoikov 2008 model is the operational complement, focused on the first component (inventory cost). It assumes a representative MM with finite inventory tolerance who wants to maximize expected utility over a finite horizon. Aldridge HFT describes the result: “for fully rational, ‘risk[-averse’]’ traders, the strategy of Avellaneda and Stoikov (2008) also outp[erforms] the ‘symmetric’ bid and ask strategy whereby the trader places […] ask limit orders that are equidistant from the [mid-price].” The asymmetry — quoting around a reservation price rather than the mid — is what makes the model work.

The Practical Guide states the MM’s job description in plain language: “as inventory [accumulates], the market maker begins to manage it, to reduce risk and enhance profitability. The two broad functions of a market maker are therefore: ■ Manage inventory” and ■ adversely-select less than they are adversely-selected against. Inventory management is the topic; the formula is the tool.

The Avellaneda-Stoikov derivation, walked through

The Avellaneda-Stoikov result is widely cited and infrequently derived. Walking through it makes the parameters meaningful.

Step 1 — the MM’s utility. Assume the MM has exponential utility (constant absolute risk aversion):

U(W, q) = -exp(-γ · (W + q · S))

Where W is cash, q is signed inventory (positive = long), S is mid-price, and γ > 0 is the risk-aversion parameter. Higher γ means more aversion to inventory.

Step 2 — the value function. The MM chooses bid and ask quotes δ_b and δ_a (half-spreads from mid) to maximize expected utility from now until horizon T. The value function v(t, S, W, q) satisfies a Hamilton-Jacobi-Bellman (HJB) equation:

v_t + (½σ²) v_SS + max_{δ_b, δ_a} { λ(δ_b)·[v(t, S, W − S + δ_b, q+1) − v] + λ(δ_a)·[v(t, S, W + S + δ_a, q−1) − v] } = 0

Where λ(δ) is the Poisson arrival rate of fills as a function of the half-spread (wider spread → fewer fills, exponentially decaying).

Step 3 — the reservation price. Avellaneda and Stoikov observed that under the exponential-utility ansatz, the value function factorises and reduces to a closed-form for the reservation price — the price at which the MM is indifferent between holding her current inventory and not:

r(s, q, t) = s − q · γ · σ² · (T − t)

Where s is the current mid, q is signed inventory, γ is risk aversion, σ² is mid-price variance, and T − t is time-to-horizon.

Step 4 — the optimal half-spread. The optimal half-spread (around the reservation price, not around the mid) is:

δ* = (γ · σ² · (T − t)) / 2 + (1/γ) · ln(1 + γ/k)

Where k is the order-flow intensity calibration constant.

Step 5 — the actual quotes.

quote_bid = r − δ*
quote_ask = r + δ*

The intuition. When q > 0 (long inventory), r < s — the reservation price drops below mid. Both quotes shift down. The ask becomes more attractive; the bid becomes less. The market hits the ask, your inventory drops back toward zero. When q < 0, the reverse. The model self-corrects inventory by quoting against the inventory, not by hedging it after the fact.

A concrete numerical example. Suppose s = 60 000 USDT, σ = 0.001 per tick (vol of mid in tick units), γ = 0.1, q = +5 BTC (5 BTC long), T − t = 1 hour ≈ 3 600 s, k = 1.5. Then:

σ² · (T − t) = 0.000001 · 3600 = 0.0036
r = 60 000 − 5 · 0.1 · 0.0036 · 60 000 = 60 000 − 10.8 = 59 989.2
δ* = (0.1 · 0.0036) / 2 + (1/0.1) · ln(1 + 0.1/1.5)
= 0.00018 + 10 · ln(1.0667)
= 0.00018 + 0.6453
≈ 0.6455 in tick units, ≈ 38.7 USDT in dollar terms (roughly)

So the MM quotes bid ≈ 59 950 USDT, ask ≈ 60 028 USDT — a meaningful skew in the direction of unwinding the long inventory.

Real systems use very different γ and shorter horizons. The point of the example is the direction of the shift, not the magnitude.

Inventory-based parameter calibration

The model has three parameters to calibrate from data: γ, σ², k. Market Making in Crypto (Stoikov et al., December 2024) walks through a calibration procedure on crypto perpetuals using the open-source Hummingbot platform. The headline contributions: an alpha signal called Bar Portion derived from candlestick data that improves directionality, and a calibration framework that estimates k from observed limit-order arrival rates rather than treating it as a free parameter. The follow-up paper "Logarithmic regret in the ergodic Avellaneda-Stoikov market making model" (arXiv:2409.02025, accessed May 2026) shows that a maximum-likelihood estimator achieves logarithmic regret bounds when learning the price-sensitivity parameter k online — meaning a properly-calibrated MM converges fast.

In practice for a small shop:

Cross-exchange MM — the most-profitable variant

A robust pattern in crypto: quote on the thinner venue, hedge on the deeper venue. You earn the thinner venue’s spread (which is wider, by definition) and pay the deeper venue’s taker fee for the hedge. Net of fees and latency, this is durably profitable on crypto if you pick the venue pair carefully and your hedge latency is sub-100 ms.

Concrete example: quote BTC-USDT on a tier-2 venue at 1 bp spread, hedge each fill via a market order on Binance perp at 2 bp net (taker fee + half-spread). Net spread = 1 − 0.5 (your half) − 2 (hedge full cost) = -1.5 bp on a fill. This loses on every trade in isolation. The trick is the fee structure: at MM-tier on the thin venue you pay -0.005% (a rebate), at top tier on Binance perp you pay 0.005% (a paid maker fee). Re-do the arithmetic with rebates and the trade goes from -1.5 bp to +0.5 bp expected value — that’s the entire business.

Toxicity filters: VPIN

VPIN (Volume-Synchronised Probability of Informed Trading) is the canonical real-time toxicity metric. Practical Guide gives the formula: “to estimate the incidence of a crash, the authors develop a volume-based probability of informed trading, or VPIN metric: VPIN ≈ (1/(nVτ)) · Σ |V_S − V_B|” where V_S and V_B are buy- and sell-classified volumes within volume buckets of size , summed over n recent buckets.

The operational effect: VPIN spikes when buy and sell volume become asymmetric within a volume window. Asymmetric volume implies one-sided informed flow, implies the MM is about to be picked off. When VPIN crosses a threshold (typically the 80th or 90th percentile of historical VPIN for the symbol), the MM should widen quotes or pull quotes entirely.

The known weakness is volume-bucketing bias: the metric depends on how you classify volume into buy/sell (Lee-Ready algorithm, BVC, tick rule), and the classification is itself imperfect. In high-frequency crypto data with frequent quote sweeps, the bucketing introduces noise. The pragmatic fix is to compute VPIN with two classifiers and pull quotes when both trigger.

Toxicity filters: Kyle’s lambda

Kyle’s lambda comes from Kyle (1985). Aldridge HFT describes it: “Kyle (1985) analyzes how a single informed trader could best take[…] advantage of his information in order to maximize his profits. Kyle (1985) describes how information is incorporated” into prices through trade. The lambda — λ — is the slope of the price-impact function: how much the mid moves per unit of net order flow. Higher lambda means each unit of flow moves the price more, which means the market is being pushed by an informed trader.

In code, a rolling Kyle’s lambda regression looks like:

# rolling 60-second window
def kyle_lambda(trades_df):
# signed_volume[t] = volume × (+1 if buy, -1 if sell)
# delta_mid[t] = mid[t+1] - mid[t]
X = trades_df['signed_volume'].values.reshape(-1, 1)
y = trades_df['delta_mid'].values
# OLS: delta_mid = alpha + lambda · signed_volume
coef = np.linalg.lstsq(X, y, rcond=None)[0]
return float(coef[0])

Production discipline: compute λ on a rolling window matching the MM strategy’s reaction time (60 s for slow quotes, 10 s for aggressive ones); when λ exceeds the historical 90th percentile, the strategy should reduce quote size, not pull entirely — Kyle’s λ is slower-moving than VPIN and false positives are costly.

Queue position economics

First in line ≠ best fill. Toxic flow front-runs the front of the queue — the informed traders know you’re there and lift you the moment the price is about to leave. This is why “be first to a level the market is about to revisit” beats “race to the front of every level.”

The policy:

A regime classifier upstream of the MM is therefore not optional. The cheapest classifier is a 30-minute ATR ratio: if realised vol over 30 min divided by realised vol over 4 hours exceeds 1.5, you are in a regime shift; widen quotes and reduce size.

Production code skeleton

A compact AS-quoter that ties the §3 polyglot stack to the §5 mathematics:

# strategy/avellaneda.py
import math
import asyncio
import ob_core # the Rust extension from §3


class ASMaker:
def __init__(self, gamma=0.1, k=1.5, horizon_s=300, max_inv=10.0):
self.book = ob_core.OrderBook()
self.q = 0.0
self.gamma = gamma
self.k = k
self.T = horizon_s
self.t0 = None
self.max_inv = max_inv
self.vpin_state = ob_core.VPINWindow(window_volume=1_000_000)
self.lambda_state = ob_core.KyleLambda(window_seconds=60)

def _toxic(self) -> bool:
vpin = self.vpin_state.value()
lam = self.lambda_state.value()
return vpin > 0.8 or lam > self.lambda_state.p90_threshold()

def _quotes(self, s: float, sigma2: float, t_left: float) -> tuple[float, float]:
r = s - self.q * self.gamma * sigma2 * t_left
delta = (self.gamma * sigma2 * t_left) / 2 + (1.0 / self.gamma) * math.log(1.0 + self.gamma / self.k)
return r - delta, r + delta

async def on_tick(self, msg, t_now: float, sigma2: float):
self.book.apply_delta(msg.side, msg.price, msg.qty)
if self.t0 is None:
self.t0 = t_now
s = self.book.microprice(5)
if s is None:
return
# hard inventory cap - non-negotiable
if abs(self.q) >= self.max_inv:
await self.cancel_all()
await self.flatten_via_taker()
return
# toxicity gate
if self._toxic():
await self.cancel_all()
return
t_left = max(self.T - (t_now - self.t0), 1.0)
bid_px, ask_px = self._quotes(s, sigma2, t_left)
await self.replace_quotes(bid_px=bid_px, ask_px=ask_px, size=self.size_for_inventory())

The hot loop (orderbook, VPIN, lambda) runs in Rust through the ob_core extension; the Python on_tick is called per market-data event but spends most of its time waiting on async I/O — the GIL is released for the duration of the awaits.

Real failure-mode case studies

  1. The MM-in-a-trend disaster. When σ² is mis-estimated (e.g. on a low-vol historical window) and the market enters a trend, the AS quoter accumulates inventory in the wrong direction faster than its skew can unwind it. The cure is twofold: a regime classifier that scales risk off on σ-spikes (not just adjusts the formula), and a hard inventory cap that cancels quotes (the max_inv branch above) rather than just adjusting them.
  2. The Hyperliquid HLP / JELLYJELLY incident, March 2025. A whale opened a short position on JELLYJELLY on Hyperliquid while simultaneously dumping the spot token on a DEX, crashing the on-chain price. Hyperliquid’s HLP (the protocol-owned market-making vault) was forced to take over the short, then the whale bought spot to squeeze the short, driving the token up by ~250%. The HLP ended up with roughly $13.5M in unrealized losses (cryptonews.com, accessed May 2026; The Block, accessed May 2026). The validators voted to delist JELLY perps; users (apart from flagged addresses) were made whole from the Hyper Foundation. The architectural lesson: a single-venue MM that cannot move inventory off-venue is a single-venue MM that can be cornered. Cross-venue MM with a hedging pipe is structurally safer.
  3. Latency-out adverse selection. Your WS feed lags by 200 ms during a vol event; you quote stale prices for 200 ms; informed flow lifts you on every quote. By the time the WS recovers your inventory is blown. Cure: stale-feed detector with millisecond resolution and an auto-cancel on staleness — the strategy should never wait for the WS to recover before pulling quotes.
  4. Regime-shift adverse selection. You calibrated γ, k, σ² on a low-vol regime; vol triples; the formula's parameters are wrong; you lose your shirt before recalibration. Cure: continuous online recalibration of σ² and k; never assume yesterday's calibration applies today.

§6 — Other HFT strategies on crypto

Statistical arbitrage

Pairs (BTC-ETH, BTC-SOL), baskets (alts versus BTC.D), or basis (perp-spot). The classical approach:

  1. Cointegration test (Engle-Granger or Johansen) to confirm a stable long-run relationship between two series. Aldridge HFT references Engle’s foundational work (Engle 1982, 2000) on time-series econometrics that underpins these tests.
  2. Estimate the spread as spread = price_A - β · price_B where β is the regression coefficient.
  3. Estimate the half-life of mean reversion using an Ornstein-Uhlenbeck fit: dS = θ(μ - S)dt + σ dW; half-life = ln(2)/θ.
  4. Trade signal: enter when |spread — μ| > k·σ_spread; exit when |spread — μ| < ε.

The 2026 crypto specifics:

Triangular arbitrage

The cleanest pedagogical HFT strategy: pure math, no forecasting, three legs of post-only IOC orders. Practical Guide defines the canonical case: “triangular arbitrage exploits temporary deviations from fair prices in three foreign exchange” pairs. In crypto the natural triangle is BTC/USDT × ETH/BTC × ETH/USDT on a single venue.

The closure condition:

edge = (BTC/USDT) × (ETH/BTC) − (ETH/USDT) ≠ 0   (after fees and slippage)

In practice on tier-1 venues (Binance, Bybit, OKX) the triangular spread is closed within microseconds — pure latency arb territory. Where it still pays in 2026: tier-3 venues with retail-grade matching, and on tier-1 during the chaos of a large liquidation cascade when the engine momentarily lags between the three pairs.

A compact closure code:

# strategy/triangular.py
async def close_triangle(venue, lots: float, fee_pct: float = 0.0002):
px = await venue.snapshot_top_of_book(["BTC/USDT", "ETH/BTC", "ETH/USDT"])
btc_usdt = px["BTC/USDT"].ask
eth_btc = px["ETH/BTC"].ask
eth_usdt = px["ETH/USDT"].bid

implied_eth_usdt = btc_usdt * eth_btc
edge_per_lot = (implied_eth_usdt - eth_usdt) - 3 * fee_pct * eth_usdt
if edge_per_lot <= 0:
return None
orders = await asyncio.gather(
venue.send_ioc("BUY", "BTC/USDT", lots, px["BTC/USDT"].ask),
venue.send_ioc("BUY", "ETH/BTC", lots * eth_btc, px["ETH/BTC"].ask),
venue.send_ioc("SELL", "ETH/USDT", lots * eth_btc, px["ETH/USDT"].bid),
)
return orders, edge_per_lot

The send-three-IOCs-in-parallel pattern is the right one. Any sequenced execution gives the market 5–10 ms to close the edge against you.

Latency arbitrage

Three subvariants with concrete venue pairs:

Funding-rate arbitrage

Long spot + short perp; harvest the funding rate spread when funding is positive. The carry is funding_rate × position_size × cycles_per_day minus borrow cost on the spot leg minus basis drift.

Capacity analysis: the trade is small per pair (you’re limited by the spot venue’s lending market depth or by your own borrow capacity), but it stacks across pairs. A diversified funding-rate book across 20–30 perp symbols can be a meaningful portion of a small shop’s P&L.

The “edge” beyond the steady-state carry is timing the entry and exit around the funding snap. Funding settles at fixed times; right after settlement is the cheapest entry (premium just paid out, basis collapsed). Right before settlement is the safest exit (locked-in funding accruing, basis stable).

Liquidation hunting on perps

Predictive: model the cluster of liquidations near a price level, position to capture the cascade. This is one of the few crypto-native strategies with no obvious equities analogue.

The model needs three inputs:

  1. Open interest by leverage decile — public on Hyperliquid (transparent on-chain), estimable on CEXes from forced-liquidation tape.
  2. The cascade-trigger function — what fraction of OI gets liquidated as price crosses each level.
  3. The reflexivity multiplier — how many liquidations chain into more liquidations.

The strategy: identify a “thick” liquidation cluster (say, $50M of cumulative liquidation between 58 000 and 59 000 USDT on BTC), front-run by going long below the cluster, ride the cascade up, exit before reflexivity exhausts.

This is data-heavy and capital-heavy; it does not fit a 2-engineer team. It does fit a team with on-chain data infrastructure and access to historical liquidation tape across all major venues.

Defensive: spoofing and iceberg detection

You don’t run these strategies — you defend against them. Practical Guide describes spoofing: “in spoofing, the trader intentionally distorts the order book without execution; in the process” influencing other participants’ decisions. It also describes icebergs: “iceberg orders […] allow limit-order traders to display only a portion of their order in the limit order book, and keep the” rest hidden.

For an MM, spoofing manifests as phantom liquidity that vanishes the moment you’d interact with it — quote-cancel-fill ratios spike for orders that flicker on and off. Icebergs manifest as one-sided pressure from invisible orders — fills happen at sizes larger than the displayed liquidity should allow.

Detection statistics, in production:

When detection triggers, the response is to widen quotes and reduce size — not to engage. The opposite mistake (chasing the visible price thinking it’s real liquidity) is exactly what spoofers want.

ML signals as a strategy multiplier

Feature engineering for HFT is its own craft. Time Series Analysis with Python Cookbook introduces feature engineering recipes, including “detecting contextual outliers with feature engineering” as a chapter-level focus, and walks through sktime-based pipelines that combine exogenous variables and ensemble learning. The HFT-specific feature catalog is short and well-known:

Supervised learning on these features predicts next-tick mid-price changes with marginal AUC over 50%. RL approaches treat the feature set as state and learn a quoting policy — dramatically harder to train, occasionally better in production.

Foundation models for time-series are the 2024–2026 frontier. Time Series Forecasting Using Foundation Models opens by saying: “the transformer architecture was proposed [for natural language but we now] study the transformer architecture from a time-series forecasting point of view.” The pragmatic answer for HFT: foundation models are useful for regime classification and macro-feature generation, not for tick-by-tick prediction — the time-scales mismatch. A 100M-parameter transformer is not faster than LightGBM at 1 ms inference, and at HFT inference latency is half of the value.

§7 — Trading automation via MCP / GenAI connectors

Anthropic introduced the Model Context Protocol in November 2024 as an open standard for connecting LLM agents to tools and data sources. By May 2026 the spec is at version 2025–11–25 (modelcontextprotocol.io/specification/2025–11–25, accessed May 2026). The 2026 roadmap focuses on three areas: streamable HTTP transport (so MCP servers can run as remote services rather than local processes), task primitives for long-running asynchronous work, and enterprise readiness — audit trails, SSO-integrated auth, gateway behaviour, configuration portability (blog.modelcontextprotocol.io — 2026 roadmap, accessed May 2026). Official SDKs exist for Python, TypeScript, C#, Java, Kotlin, and PHP; community SDKs cover Rust and Go.

The corpus’s Hands-On Machine Learning with scikit-learn and PyTorch references the protocol directly in its agent-orchestration chapter — and notes the load-bearing rule that any production deployment must adopt: “LLMs are often unreliable, so let’s keep humans in the loop for important matters, shall we?” The exact form of the human-in-the-loop discipline is the single most important rule of using LLM agents in trading.

Agents are researchers, not executors

Live order placement requires deterministic policy gates and human approval. Period.

Every agent-action has to be logged with input + chain-of-thought + tool calls — every. single. one. If you can’t reproduce why an agent did something, you can’t operate it. The pattern from Machine Learning Platform Engineering of routing queries through an LLM (result = self.router_llm.invoke(routing_prompt)) is a research-time pattern; it never flows to a live order endpoint.

Reference architecture

┌──────────────────────────────────────────────────────────────┐
│ Agent (Claude Opus 4.7 / GPT-5 / Llama-X) │
│ │ │
│ │ MCP protocol (stdio or streamable HTTP) │
│ │ │
│ ├──→ exchange-data MCP (read: L2 books, trades) │
│ ├──→ on-chain MCP (read: mempool, defillama) │
│ ├──→ news/social MCP (read: filtered firehose) │
│ ├──→ knowledge-base MCP (read: corpus search) │
│ ├──→ backtest-runner MCP (read+exec: your framework) │
│ └──→ research-notes MCP (write: append-only log) │
└──────────────────────────────────────────────────────────────┘


┌────────────────────┐
│ Policy gate + │
│ human review │
└────┬───────────────┘
│ (manual approval)

┌────────────────────┐
│ Strategy engine │
│ (Rust + Python) │
└────────────────────┘

Connector inventory in detail

For each connector, the schema and side-effects matter more than the prose.

Exchange-data MCP — one server per venue. Read-only.

On-chain MCP — read-only.

News/social MCP — filtered firehose, read-only.

Knowledge-base MCP — local document corpus.

Backtest-runner MCP — read + execute on your own infrastructure.

Research-notes MCP — write-only, append-only.

Reference Python skeleton

A minimal MCP server for the exchange-data connector, using FastAPI patterns from the corpus:

# servers/exchange_data.py
from contextlib import asynccontextmanager
from collections.abc import AsyncIterator
from mcp.server import Server
from mcp.types import Tool, TextContent
from fastapi_cache import FastAPICache
from fastapi_cache.backends.redis import RedisBackend
import redis.asyncio as redis
import asyncio


server = Server("exchange-data")
@asynccontextmanager
async def lifespan(_server) -> AsyncIterator[None]:
redis_client = redis.from_url("redis://localhost:6379")
FastAPICache.init(RedisBackend(redis_client), prefix="exchdata-cache:")
try:
yield
finally:
await redis_client.close()

@server.list_tools()
async def tools():
return [
Tool(
name="get_book",
description="Snapshot of L2 orderbook for a symbol on a venue. Read-only.",
inputSchema={
"type": "object",
"properties": {
"venue": {"type": "string", "enum": ["binance", "bybit", "okx", "hyperliquid"]},
"symbol": {"type": "string"},
"depth": {"type": "integer", "default": 20, "maximum": 100},
},
"required": ["venue", "symbol"],
},
),
]

@server.call_tool()
async def call(name: str, args: dict):
if name == "get_book":
snapshot = await book_cache.get(args["venue"], args["symbol"], args.get("depth", 20))
return [TextContent(type="text", text=snapshot.to_json())]
raise ValueError(f"unknown tool: {name}")


if __name__ == "__main__":
asyncio.run(server.run_stdio(lifespan=lifespan))

The cache pattern is from Building Generative AI Services with FastAPI, which gives the canonical install-and-configure recipe: “you can install FastAPI cache using the following command: pip install \"fastapi-cache2[redis]\" ... configuring FastAPI cache lifespan ..." The book continues with a Redis-backed lifespan manager exactly matching the snippet above. Caching is non-optional for the exchange-data MCP; agents query it ten times per second and the venue API costs would otherwise dominate the budget.

Token-budget circuit breaker

Bounded-cost agent execution is itself a discipline. The pattern:

class BoundedAgentRun:
def __init__(self, agent, max_input_tokens=200_000, max_output_tokens=20_000):
self.agent = agent
self.max_in = max_input_tokens
self.max_out = max_output_tokens
self.consumed_in = 0
self.consumed_out = 0

async def step(self, prompt: str):
if self.consumed_in + len(prompt) > self.max_in:
raise RuntimeError("agent input budget exhausted")
response = await self.agent.respond(prompt)
self.consumed_in += response.input_tokens
self.consumed_out += response.output_tokens
if self.consumed_out > self.max_out:
raise RuntimeError("agent output budget exhausted")
return response

The error conditions are intentional: an agent that loops on a bad query burns budget. Cap per-task token spend; raise visibly when the cap is hit; never silently truncate.

Provenance manifest

Every artifact the agent produces carries a manifest that lets a human reproduce it:

{
"hypothesis_id": "h-2026-04-12-0008",
"agent": "claude-opus-4-7",
"agent_model_version": "20260301",
"temperature": 0.2,
"ingest_sources": [
"Aldridge, HFT 2nd ed., ch. 7 (Avellaneda-Stoikov)",
"Aldridge, HFT 2nd ed., ch. 12 (VPIN)",
"Stoikov et al. 2024, SSRN 5066176"
],
"tool_calls": [
{"tool": "get_klines", "args": {"venue": "binance", "symbol": "BTCUSDT"}, "ts_iso": "2026-04-12T08:14:23Z"},
{"tool": "run_backtest", "args": {"config_hash": "sha256:abc123..."}, "ts_iso": "2026-04-12T08:17:01Z"}
],
"result_summary": {"median_sharpe": 1.21, "p05_sharpe": 0.34, "kill": false},
"human_review_status": "pending",
"human_reviewer": null,
"human_review_decision": null
}

Without this, an agent’s “great new strategy” is a black box. With it, you can audit, reproduce, and reject on principle.

Hard rules with reasoning

  1. No write-tools to production trading systems. Ever. Research notes only. Reason: the failure mode of an agent placing a live order is unbounded; the failure mode of an agent writing a bad note is bounded.
  2. Token-budget circuit breakers per task. Reason: an agent on a bad query will loop; without a cap, you wake up to a four-figure API bill.
  3. Provenance manifest on every artifact. Reason: the agent’s output is a draft for human review; the manifest is what makes human review possible.
  4. Sandboxing — MCP servers in separate processes/containers. Reason: the agent talks to servers over stdio or HTTP, never by sharing memory; a compromised connector cannot exfiltrate from another connector.
  5. No agent decision is ever final. Reason: this is the human-in-the-loop principle that Hands-On Machine Learning with scikit-learn and PyTorch names directly.

Anti-patterns

§8 — Backtesting and auto-research with AI

Backtesting is the most failure-prone part of any trading system. HFT-specific compounders make it worse.

Why HFT backtests are uniquely hard

A complete enumeration:

Walk-forward methodology

Market Timing With Moving Averages by Zakamulin gives the canonical recipe: “in an out-of-sample testing procedure, in-sample segment of data can be either rolling or expanding.” The two variants:

For HFT, rolling is almost always right. The market regimes that mattered in 2018 don’t apply in 2026. The window size W is the tunable parameter; pick a window long enough for parameter stability (≥ 30 days for daily-cadence strategies, ≥ 5 days for tick-cadence) and short enough to discard regime-stale data.

Monte Carlo perturbation

Once you have a walk-forward backtest, run it 1 000 times with perturbed inputs to get a distribution of Sharpe ratios rather than a point estimate. Chan’s Algorithmic Trading: Winning Strategies and Their Rationale is direct: “unlike Monte Carlo optimization, the historical returns offer insufficient data to determine an optimal leverage that works well for many realizations. Despite these caveats, brute force optimization over the backtest” remains the practical baseline once Monte Carlo perturbation is added on top.

What to perturb:

A strategy that survives 1 000 Monte Carlo paths with median Sharpe > 1 and 5th-percentile Sharpe > 0 is a strategy worth paper-trading. A strategy with median Sharpe 1.5 but 5th-percentile -0.5 is a strategy that will work on average and ruin you on the bad runs.

Combinatorial purged cross-validation

López de Prado’s combinatorial purged cross-validation (CPCV) is the state-of-the-art antidote to the bias compounders for ML-driven HFT strategies. The method “systematically constructs multiple train-test splits, purges overlapping samples, and enforces an embargo period to prevent information leakage” (towardsai.net — CPCV, accessed May 2026; foundational paper SSRN 4778909, accessed May 2026).

The mechanics: divide a time-series dataset into N sequential, non-overlapping groups that preserve temporal order. Then choose all combinations of k groups (k < N) as test sets, with the remaining N − k groups used for training. Purging removes training samples that overlap in time with test samples; embargoing enforces a no-information-flow gap immediately after test windows. The result is a distribution of performance metrics across many backtest paths, enabling the Deflated Sharpe Ratio as a rigorous test statistic.

For HFT specifically, CPCV’s main advantage over walk-forward alone is that you get many paths instead of one, so a single bad regime doesn’t doom or vindicate the entire strategy.

Time-series-specific backtest pitfalls

Time Series Analysis with Python Cookbook covers the pitfalls of train-test splits with autocorrelated data — pure k-fold cross-validation on time-series is wrong because it implicitly leaks future-into-past. Time Series Forecasting Using Foundation Models extends the discussion to transformer-based forecasters, where the typical sequence-modelling tricks (random shuffling, batch construction) violate temporal causality unless explicitly handled.

The pragmatic discipline: for any time-series backtest, the only safe split is sequential. Train on the past; test on the future. Never the reverse. Never random.

The auto-research workflow

The AI part of the title. The workflow as a state machine:

┌─────────────────────────────────────────────────────────────┐
│ 1. Agent reads new paper / blog post / corpus chapter │
│ (via knowledge-base MCP) │
│ ↓ │
│ 2. Drafts a falsifiable hypothesis with provenance │
│ ("X feature on Y venue should predict Z over τ") │
│ ↓ │
│ 3. Calls backtest-runner MCP with parameter grid │
│ (single token-budget; bounded by circuit breaker) │
│ ↓ │
│ 4. Receives top-K results + sanity metrics │
│ (turnover, max consecutive losses, regime split) │
│ ↓ │
│ 5. Writes draft research note to research-notes MCP │
│ ↓ │
│ 6. Human reviews. Kill-rate is tracked. │
└─────────────────────────────────────────────────────────────┘

The kill-rate metric

The fraction of agent-proposed strategies that don’t survive walk-forward + Monte Carlo + paper-trading is the single best productivity metric for an auto-research pipeline. (“Kill-rate” is the author’s term, not a standard one.)

A healthy auto-research pipeline has a kill-rate of 90–95%. A pipeline with a kill-rate of 30% is overfitting at the agent-layer; you’re going to lose money in production. A pipeline with a kill-rate of 99% is wasting compute; tighten the agent’s hypothesis-generation prompt to filter low-quality ideas before they hit backtest.

Anti-patterns at the AI-research layer

In addition to the patterns in §7:

Foundation models for time-series — when it makes sense

A foundation transformer for time series is a useful tool but not a universal hammer. In Time Series Forecasting Using Foundation Models the framing is that “the transformer architecture was proposed [originally for natural language and] is now applied to forecasting.” The pragmatic decision rule:

A real workflow example end-to-end

A hypothetical agent run on the hypothesis “BTC perp basis × VIX is a regime-conditioned predictor of perp-spot mean reversion”:

  1. Agent ingests via kb_search("perp basis VIX") → finds chunks from Aldridge HFT on basis trades, plus a 2024 paper on cross-asset volatility-conditioned strategies.
  2. Agent drafts hypothesis: “When VIX is in its 80th percentile (high macro vol), BTC perp-spot basis mean-reverts faster than in the bottom 20th percentile.”
  3. Agent calls run_backtest(config={signal: 'basis', conditioner: 'vix_decile', mc_seeds: 1000}).
  4. Result: median Sharpe 0.8, 5th percentile -0.3, but inner quintile (40–60th) Sharpe of 1.4. Conditioning works in moderate-vol regimes; breaks in extremes.
  5. Agent writes note: “Hypothesis partially confirmed — restrict to 40–60th-percentile VIX. Recommend manual review for production scoping.”
  6. Human reviewer: accepts the conditioned version, adds a separate kill-switch on VIX > 90th percentile, deploys for paper-trading.

Without the agent, this workflow takes a quant analyst three days. With the agent, it takes 90 minutes including human review. The agent does not get to skip the human review.

§9 — Production concerns

The mathematics gets you to a paper-tradeable strategy. Production gets you to a P&L. The corpus has solid grounding here — the Latency book gives the framework, Machine Learning Platform Engineering covers the deploy and monitor stack, Mastering Software Architecture covers the patterns that hold a trading system together, and Blue Team Handbook covers the security side that nobody talks about until they’ve been compromised.

Latency budget per segment

Repeated for completeness; the per-segment table from §3 is the operational target:

Segment Budget Implementation WS frame arrival → kernel 0–5 µs Linux io_uring WS frame decode 5–15 µs Rust + simd-json Orderbook update 1–5 µs Rust, lock-free Signal compute 5–30 µs Rust, SIMD where possible Strategy decision 100–1000 µs Python branch via PyO3 Order encode 5–15 µs Rust NIC send 0–5 µs Same as arrival Total tick-to-trade ~150–1100 µs

For market-making strategies on crypto in 2026, sub-1-millisecond tick-to-trade is competitive. For latency arbitrage at the tier-1 level, you need < 100 µs and a co-located rack — the polyglot stack alone won’t get you there.

Co-location and cloud

The cheap version of co-location is “the same AWS region as the matching engine.” The Latency book covers the principle in its co-location chapter — same-region latency to the venue matching engine is typically 1–5 ms; cross-region is 50–200 ms. Cross-region is fatal for any latency-sensitive strategy.

The 2026 venue map:

If you trade more than one venue, you cannot be in the right region for all of them. Pick one — the dominant venue for your strategy — and accept asymmetric latency for the others.

Risk gates — the non-negotiables

Every strategy ships behind every one of these.

Kill switch at the firm level. Practical Guide references the canonical case: “a kill switch allows termination of all flow from a broker-dealer whose algorithms are determined to be corrupt. In the Knight Capital case, an execution-firm-level kill” switch was missing — and the firm lost ~$440M in 45 minutes. Implement as: a single privileged process holds the cancel-all + halt-all authority; any operator can hit it; the engine respects it within 100 ms.

Position limits per symbol and aggregate. Pre-trade-checked. Every order goes through a check that knows current position and would-be position; reject if the post-fill state exceeds the cap. Easy to implement, easy to forget the corner cases (multi-leg orders, partial fills, race conditions on simultaneous fills).

Drawdown circuit breaker. Chan’s Quantitative Trading covers the pattern: maintain a running high-watermark of cumulative compounded returns, define drawdown at each step as the percentage shortfall from that watermark, and track drawdown duration alongside it. In production: max intra-day loss → halt strategy automatically; max 7-day drawdown → halt and require manual restart with a written incident note.

Self-trade prevention. Most CEXes implement it for you (a maker order from your account against a taker order from the same account is rejected). Verify in your own logs nonetheless.

Stale-feed detector with millisecond resolution. If WS hasn’t ticked in 200 ms, cancel all open quotes. The threshold is symbol-dependent — 200 ms is right for BTC-perp; 1 s might be right for an alt with 10 trades per minute.

Hot config reload and canary deploys

Strategy parameters (γ, k, position limits, fee tiers, venue selection) live in a config file the engine watches. Reload without restart. Never hand-edit live. Promote configs through canary-strategy → 1% capital → full deployment, with explicit rollback if any of the canary's metrics deviate beyond a threshold.

The discipline:

  1. Author writes config change as a PR
  2. CI runs the change against the last 5 days of historical data — does the strategy’s metric distribution shift?
  3. Canary deploy at 1% of normal capital; observe for 24 hours
  4. Promote to full if and only if metrics are within tolerance
  5. Rollback is a single command and is tested weekly

Monitoring

Prometheus + Grafana is the boring 2026 default. Machine Learning Platform Engineering gives the canonical Helm-based install: “install Helm and use some popular Helm commands that help us install and update applications,” followed by helm repo add prometheus-community ... && helm upgrade -i prometheus prometheus-community/prometheus --namespace prometheus --create-namespace. The same repo provides Grafana via helm install grafana grafana/grafana.

HFT-specific custom metrics worth adding beyond the Kubernetes basics:

Metric Why it matters book_update_lag_microseconds{p50, p99, p999} Detects ingestor-strategy decoupling queue_position_decile{symbol} Per-resting-order; if mostly p9 you're at the front, p1 you're at the back fill_toxicity_vpin{symbol} Rolling VPIN; alert at p95 of historical cancel_to_fill_ratio{symbol, side} Spike means your edge has decayed or someone is spoofing you funding_pnl_realised vs expected Detects funding-rate-arb mis-execution inventory_distance_to_cap{symbol} One-sided inventory drift early warning latency_strategy_decision_microseconds{p50, p99, p999} The Python branch — your single biggest variance source

Alerts on tail percentiles, not means. The mean is always fine; the tails are where you die. The Latency book makes this point explicit and the operational discipline follows from it.

Architecture patterns

Mastering Software Architecture covers the patterns that scale a trading system without becoming a tangled mess. Two specifically relevant for HFT:

The pragmatic application: an HFT engine should look like a pipeline of decoupled stages connected by ring buffers, not a monolithic strategy class. Stage decoupling is the difference between “fix one bug” and “rewrite half the engine.”

Security — the part that nobody plans for until it bites

Blue Team Handbook (the corpus’s nod to defensive security) covers the incident-response checklist for an algo-trading firm — and trading firms are juicy targets. The minimum:

Disaster recovery

Restart-replay protocol: snapshot orderbook + position state every 100 ms to a durable store; on restart, replay forward from the most recent snapshot. The protocol is straightforward in principle, full of corner cases in practice — the order of snapshot vs trade event, the gap between last snapshot and crash, the resumption of WS subscriptions.

The discipline: practice the full restart from snapshot once a month, in production, during a low-vol window. The first time you do it should not be when something is on fire.

§10 — What classical chart-reading still teaches the algo trader

Modern HFT literature can leave you with the impression that microstructure mathematics has rendered traditional technical analysis obsolete. It hasn’t. Two ideas from the older chart-reading canon survive intact into the algo era — and an MM or stat-arb operator who forgets them is the operator who gets ambushed by regime change.

Markets are nonlinear dynamical systems, not random walks. Bill Williams’s Trading Chaos: Maximize Profits with Proven Technical Techniques makes the point that “chaos” in the physics sense is not disorder — it is the study of complex nonlinear systems whose behaviour is deterministic but practically unpredictable. The implication for an Avellaneda-Stoikov quoter is direct: a pricing model that assumes Gaussian returns will be ambushed by the nonlinear regime in which it doesn’t. Realised vol clusters; correlations break under stress; queue dynamics flip phase. Your model needs a regime detector upstream of the formula, not a fatter-tailed distribution stuffed into the formula.

Pattern size is a proxy for the magnitude of the move that follows. John J. Murphy’s The Visual Investor: How to Spot Market Trends observes that the larger a reversal pattern is on the vertical axis (i.e. the higher its realised volatility during formation), the larger the subsequent price potential tends to be. The 2026 algo translation: regimes with elevated realised volatility tend to produce larger directional moves once they break. The σ² parameter in your AS quoter is not a passive scaling constant. It is the regime detector hiding in plain sight, and it should drive position sizing, not just spread width.

The bridge to ML feature engineering. The patterns the chart-reading authors codified — head-and-shoulders, support-resistance, breakout-on-volume — are the categorical labels an ML feature pipeline naturally learns when handed enough OHLCV data. The classical literature was right about what to look for; modern practice replaces the human eye with a feature pipeline. The two traditions are in agreement, and the working algo trader should read both.

§11 — The 2026 outlook

A few opinions, with the caveat that opinions about 2026 will look silly by 2028.

Rust + Python is winning the mid-tier. It will not displace pure C++ at the very top end. It does not need to. The mid-tier — single-digit-engineer shops running market-making, basis, and stat-arb at retail-accessible scale — is where most new HFT firms in 2026 are being founded, and the polyglot stack is durably ahead there. The hiring data supports this: a Python-and-Rust quant is dramatically easier to find than a fluent low-latency C++ engineer in 2026, and the productivity per engineer is higher.

AI agents are crossing into research, not signal generation. The kill-rate from auto-research pipelines is high enough to make agents net-productive at hypothesis generation and parameter screening. They are not good enough to generate live trading signals end-to-end. The narrow path through which an agent contributes to a live system is via human-reviewed strategy code that the agent helped draft — not through autonomous execution. The Anthropic 2026 MCP roadmap’s emphasis on “task primitives” and “enterprise readiness” suggests the protocol’s roadmap aligns with this: deeper research workflows, audit trails, gated execution. Not agents-running-the-trading-floor.

Crypto fragmentation is permanent. CEX, DEX, perp DEX, intent-based protocols, L2s — the venue list will grow, not shrink. This is good for HFT (more arb opportunities) and bad for capital efficiency (more places to manage inventory). Bet on architecture that accepts fragmentation, not on a single venue. The 2025–2026 consolidation of L2 TVL into Base (46.58%) and Arbitrum (30.86%) suggests the on-chain side may consolidate, but the off-chain CEX list will keep diversifying.

MEV continues to evolve, not vanish. The professional MEV searchers are now capital-rich and software-mature. Hobbyist edges on mainnet are gone. The shift is to L2s and to intent-based protocols where the MEV game has different rules. The January 2026 academic finding that “naive heuristics overstate sandwich activity, with the majority of flagged patterns being false positives and the median net return for these attacks being negative” on private-mempool rollups suggests MEV is being structurally compressed, not eliminated. A 2026 HFT shop that touches DEXes needs to understand both as a mitigation (don’t lose to sandwich attacks on your own swaps) and as an opportunity (back-run legitimate price discoveries).

Tokenisation of equities and FX. The slow movement of TradFi onto crypto-native rails (RWA tokens, on-chain Treasuries, eventually on-chain equities) means crypto-native HFT infrastructure increasingly has to handle non-crypto flows. The polyglot stack is well-positioned; the venue selection becomes a TradFi question more than a crypto-native one.

The ASIC / FPGA frontier for crypto specifically. For the very thin margins on cross-venue latency arb, hardware acceleration is becoming relevant. FPGAs running on the orderbook decode + signal compute path are real in 2026, but they are unnecessary at the mid-tier — they are a year-five optimisation for a shop that has saturated the polyglot stack’s potential. Don’t buy hardware until you’ve run out of software wins.

Regulation 2026. MiCA in the EU (now in full enforcement) makes some forms of cross-venue MM-with-rebate a compliance question. The US perp-DEX status is unsettled but stable enough to operate. The Binance post-settlement environment has stabilised; the firm is back to growth, with stricter compliance. None of this changes the architecture — it changes the venue list.

Closing thought. The polyglot stack is durable; the venue list is not. Bet on architecture, not on a name. And invest in your test harness — the strategy will be wrong, the harness will tell you.

§12 — Methodology, sources, reading list

Which sections lean on cited literature, and which draw on industry knowledge

Section Grounding §1 Intro mixed, lightly cited §2 Strategy taxonomy grounded (Aldridge, Practical Guide, Time Series Forecasting Using Foundation Models) §3 Polyglot Python + Rust grounded (Latency: Reduce Delay, Building GenAI Services with FastAPI) — was external in v1 §4 Crypto microstructure mixed (corpus on principles; internet for 2026 fee schedules + L2 data + intents) §5 Market making heavily grounded (Aldridge, Practical Guide, Stoikov 2024) §6 Other strategies grounded (Aldridge, Practical Guide, Time Series Analysis with Python Cookbook) §7 MCP / GenAI connectors mixed (corpus on FastAPI + LLM orchestration + MCP mention; internet for current MCP spec) §8 Backtesting + AI auto-research grounded (Zakamulin, Chan, TS Cookbook, Foundation Models; internet for CPCV) §9 Production concerns grounded (Latency, ML Platform Engineering, Mastering Software Architecture, Blue Team Handbook; Practical Guide on kill switch) §10 Classical chart-reading grounded (Williams, Murphy) §11 Outlook opinion, with internet citations on L2 data and MEV research §12 Methodology this section

Bibliography

Core HFT theory and microstructure:

  1. Aldridge, Irene. High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading Systems. 1st ed. Wiley, 2010.
  2. Aldridge, Irene. High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading Systems. 2nd ed. Wiley, 2013.
  3. Chan, Ernest P. Quantitative Trading: How to Build Your Own Algorithmic Trading Business. Wiley, 2008.
  4. Chan, Ernest P. Algorithmic Trading: Winning Strategies and Their Rationale. Wiley, 2013.
  5. Zakamulin, Valeriy. Market Timing With Moving Averages: The Anatomy and Performance of Trading Rules. Palgrave Macmillan, 2017.
  6. Grimes, Adam. The Art and Science of Technical Analysis: Market Structure, Price Action, and Trading Strategies. Wiley, 2012.

Engineering and systems:

  1. Latency: Reduce Delay in Software Systems. Manning, 2024.
  2. Building Generative AI Services with FastAPI. O’Reilly, 2024.
  3. Machine Learning Platform Engineering. Manning, 2024.
  4. Mastering Software Architecture. O’Reilly, 2024.
  5. Architecting AI Software Systems. O’Reilly, 2024.
  6. Rust for Blockchain Application Development. Packt, 2024.
  7. Blue Team Handbook: Incident Response Edition. Created Independently, 2014; reissued.

Machine learning and time-series:

  1. Hands-On Machine Learning with scikit-learn and PyTorch. O’Reilly, 2025.
  2. Time Series Analysis with Python Cookbook. Packt, 2024.
  3. Time Series Forecasting Using Foundation Models. Manning, 2025.
  4. Practical Generative AI with ChatGPT. O’Reilly, 2024.

Classical chart-reading:

  1. Williams, Bill. Trading Chaos: Maximize Profits with Proven Technical Techniques. 2nd ed. Wiley, 2004.
  2. Murphy, John J. The Visual Investor: How to Spot Market Trends. 2nd ed. Wiley, 2009.

Internet sources cited (with access dates)

What this corpus and the cited internet sources do not cover

The exact production parameters of any specific operating shop (γ, k, exact symbol selection, exact venue routing) — these are competitive secrets and are not in the literature. Sub-microsecond C++ tier-1 specifics (FPGA, ASIC) — covered by industry conference talks, not by the corpus. Real-time regulatory changes after May 2026 — verify with the relevant venue and jurisdiction at write-time of any production decision.

Short follow-up reading list

If the article was useful, the ten books to read next, in order:

  1. Aldridge — High-Frequency Trading: A Practical Guide (2nd ed.) — the spine
  2. Chan — Algorithmic Trading: Winning Strategies and Their Rationale — backtest discipline
  3. Zakamulin — Market Timing With Moving Averages — walk-forward methodology
  4. Grimes — The Art and Science of Technical Analysis — the human side
  5. Latency: Reduce Delay in Software Systems — production latency engineering
  6. Machine Learning Platform Engineering — deploy, monitor, scale
  7. Building Generative AI Services with FastAPI — the agent layer
  8. Hands-On Machine Learning with scikit-learn and PyTorch — the ML toolkit
  9. Time Series Forecasting Using Foundation Models — when transformers matter
  10. Mastering Software Architecture — patterns that hold the system together

Comments and corrections are welcome. The Avellaneda-Stoikov derivation, the VPIN definition, and the JELLYJELLY incident details are reproduced from the cited sources; if you spot an error against the original papers or news reports, please flag it — algo traders die from undetected formula bugs and from outdated incident summaries.

Vlad Benkovskyi, codefather.dev

This article was originally published on Cryptocurrency Tag and is republished here under RSS syndication for informational purposes. All rights and intellectual property remain with the original author. If you are the author and wish to have this article removed, please contact us at [email protected].

NexaPay — Accept Card Payments, Receive Crypto

No KYC · Instant Settlement · Visa, Mastercard, Apple Pay, Google Pay

Get Started →