High-Frequency AI Based Trading on Crypto in 2026

Vlad Benkovskyi (codefather.dev)56 min read·Just now

A Python + Rust Polyglot, with AI/MCP in the Research Loop

Press enter or click to view image in full size

Length: ~15 000 words / 60-minute read. Designed for non-linear consumption — jump to §5 (market making) or §8 (AI auto-research) if those are why you’re here.

Why this article exists
HFT 2026: the strategy taxonomy
The Python + Rust polyglot
Crypto microstructure 2026
Market making, the deep dive
Other HFT strategies on crypto
Trading automation via MCP / GenAI connectors
Backtesting and auto-research with AI
Production concerns
What classical chart-reading still teaches the algo trader
The 2026 outlook
Methodology, sources, reading list

TL;DR — for the reader who needs the bottom line in 60 seconds

The architecture. A small algo shop in 2026 should write the latency-sensitive path (WebSocket decode, orderbook, signal compute, order encode) in Rust and the orchestration / research / backtest / monitoring path in Python, bridged by PyO3. The polyglot stack achieves sub-millisecond tick-to-trade on commodity AWS hardware, which is competitive everywhere except the absolute tier-1 latency-arbitrage frontier where co-located C++ still wins. The mid-tier is where most new HFT firms in 2026 are actually being founded.

The strategies that pay. Cross-exchange market making with Avellaneda-Stoikov skew quoting and a hedging pipe; perpetual-spot basis with funding-rate harvesting; targeted triangular arbitrage on tier-3 venues during liquidation cascades. Pure latency arbitrage at the tier-1 level is structurally hard for a small team. Statistical arbitrage on alt-coin pairs is largely arb’d-out.

The AI part. AI agents wired through the Model Context Protocol have crossed into the research and backtest layer — as hypothesis generators and parameter screens — but not into live execution. Every agent action is a draft for human review; the kill-rate of agent-proposed strategies is a direct productivity metric.

The discipline. Walk-forward backtesting + Monte Carlo perturbation + combinatorial purged cross-validation are the antidote to data-snooping. Tail-latency monitoring at p99/p999, hard inventory caps, kill switches, and a monthly disaster-recovery drill are the antidote to operational disasters. The Hyperliquid HLP / JELLYJELLY incident in March 2025 (a roughly $13.5M unrealised loss) is the canonical 2025 lesson on cornered single-venue MMs.

The bet. The polyglot stack is durable; the venue list is not. Bet on architecture, not on a name.

§1 — Why this article exists

Three structural shifts have made the textbook HFT picture from the late 2010s obsolete for any shop that isn’t sitting on a co-located rack inside a Mahwah or Aurora datacenter.

First, crypto venues are now the primary retail-accessible HFT habitat. Binance, Bybit, OKX, Hyperliquid, dYdX v4, Aevo, Coinbase — fragmented across CEX, DEX, perp DEX, and L2, with native perpetuals, 24/7 markets, and matching engines that run from sub-millisecond on the centralised side down to one-block latency on the on-chain side. The classical equities microstructure literature still applies. The operating environment is alien to it.

Second, polyglot Python + Rust has displaced the C++ monoculture in the mid-tier. Rust handles the hot path — WebSocket decode, orderbook update, signal compute, FIX or binary egress. Python handles orchestration, ML serving, research notebooks, monitoring, and the increasingly important AI-research layer. This is not a compromise. For teams of fewer than a dozen engineers, it is the better architecture. The talent pool is wider, the dev velocity is higher, and the marginal microsecond C++ might buy you is invisible in crypto where the matching engines themselves run at 100 µs to 1 ms tick-to-trade.

Third, AI agents wired through the Model Context Protocol have crossed into the research and backtest layer. Not as executors — never as executors — but as hypothesis generators, parameter-grid screens, and literature ingesters. The kill-rate of agent-proposed strategies is the new productivity metric for a research team, and most of the workflow that used to take a quant analyst a week now takes a properly-scoped MCP agent an afternoon.

This article is a synthesis. The theory sections lean heavily on a personal trading library: Irene Aldridge’s High-Frequency Trading, her revised High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading Systems, Ernest Chan’s Algorithmic Trading: Winning Strategies and Their Rationale and Quantitative Trading, Valeriy Zakamulin’s Market Timing With Moving Averages, Adam Grimes’s The Art and Science of Technical Analysis, Bill Williams’s Trading Chaos, John Murphy’s The Visual Investor, plus a stack of recent engineering books — Latency: Reduce Delay in Software Systems, Building Generative AI Services with FastAPI, Hands-On Machine Learning with scikit-learn and PyTorch, Machine Learning Platform Engineering, Time Series Analysis with Python Cookbook, Time Series Forecasting Using Foundation Models, Mastering Software Architecture, Architecting AI Software Systems, and Rust for Blockchain Application Development. Where the books cover a topic, citations are inline. The 2026-specific items — current fee schedules, the latest MCP spec, recent papers, the Hyperliquid March 2025 incident — come from dated internet sources and are flagged as such.

Who this article is for. Three readers, in order:

A working quant or technical trader who already understands the basics — this article is the 2026-current synthesis you wished existed when you started reading textbooks from the 2010s.
A senior engineer crossing into trading from another domain — this article will save you six months of figuring out which books actually map to the operating reality.
A founder of a small algo shop deciding what stack to build — this article is the architectural argument, with citations.

What this article is not. It is not a tutorial. It will not teach you Rust, will not walk you through your first backtest, and will not tell you that you can get rich quoting both sides of a thin alt-coin. It is a map of the durable architectural patterns hiding under the venue-of-the-month noise, and a reading list dense enough that you can spend the next year filling in the gaps from the corpus rather than from random Twitter threads.

§2 — HFT 2026: the strategy taxonomy

Strategies in the HFT zoo split into four families. The boundaries blur — a real production system usually runs two or three at once — but each family has a distinct mathematical signature and a distinct latency profile, and conflating them is the most common source of mis-architected first systems.

Latency-driven

These are the strategies the public hates. Latency arbitrage is the canonical example: the same instrument is priced fractionally differently across venues for tens of microseconds, and you race to capture it. Aldridge frames latency arbitrage as the headline example of HFT-as-controversy: in the Practical Guide she writes that “latency arbitrage is often pinpointed by the opponents of HFT as the most direct example of the technological” race being problematic for fair markets, then proceeds to show that without latency arbitrage, prices don’t converge across venues and the market is less efficient.

Three subvariants matter for crypto:

Cross-venue same-asset latency arb — BTC-perp on Binance versus Bybit, or ETH-perp on OKX versus Hyperliquid. The edge is shrinking but not gone; in 2026, microseconds-to-milliseconds of price disagreement still exist after volatility events, and a co-located node can capture them.
Same-venue cross-instrument arb — the classical triangular arb on a single venue (BTC/USDT × ETH/BTC × USDT/ETH). On tier-1 venues this is mostly arb’d-out; the remaining edge is in tier-3 venues and during liquidation cascades when the engine momentarily lags.
Cross-rollup MEV-arb — the on-chain analogue. Researchers have begun formalising this as “cross-rollup MEV: non-atomic arbitrage across L2 blockchains” (arXiv:2406.02172, accessed May 2026). It blends mempool surveillance, intent-based protocol awareness, and sequencer-timing knowledge.

Liquidity-providing

You quote a two-sided market and earn the spread, minus inventory cost, minus adverse selection. This is market making and gets its own deep-dive in §5. The mathematics goes back to Glosten and Milgrom (1985), whose adverse-selection model showed that “in the presence of a large number of informed traders, a market maker will set unreasonably high spreads in order to break even” — a formulation paraphrased in Aldridge HFT. The Avellaneda-Stoikov 2008 paper is its operational descendant; in 2024 Stoikov published Market Making in Crypto (Stoikov et al., SSRN 5066176, accessed May 2026) which adapted the framework to crypto perpetual contracts and built it on top of the open-source Hummingbot platform.

Statistical

Cointegrated pairs, lead-lag relationships, mean-reverting baskets. In Aldridge HFT the relevant chapter defines cointegration as “a popular technique used for optimal portfolio construction, hedging, and risk management” — the “contemporaneous or lagged effect of one variable on another.” The distinction from correlation is subtle and load-bearing: two cointegrated series can have low day-to-day correlation but a stable long-run relationship that mean-reverts. Pairs trading lives or dies on whether the spread is genuinely cointegrated (Engle-Granger or Johansen test) versus merely correlated (which decays whenever volatility regimes shift).

In crypto 2026, the durable stat-arb edge has been basis — long spot, short perp, harvest the funding rate spread — rather than classical pairs trading. Pairs trading on alt-coins has been crowded out by AMM-flow noise and by the speed at which alts come and go from venue lists.

Predictive ML

Regressors trained on order-book microstructure features to predict next-tick mid-price; classifiers tagging market regime; reinforcement-learning policies for queue-position management. Aldridge HFT makes the point bluntly: “‘neural network’ is sometimes perceived to signal advanced […] of a high-frequency system. In reality, neural networks are built […to] simplify algorithms dealing with econometric estimation.” The book is right and the framing has aged well — neural networks in HFT are signal sources, not strategies.

The 2026 update is foundation models for time-series. Time Series Forecasting Using Foundation Models opens by saying the transformer architecture “was proposed [for natural language but] is now applied to forecasting … from a time-series forecasting point of view.” For an HFT desk in 2026, the practical question is not whether to use a transformer — the question is when a 100M-parameter transformer beats a 200-feature LightGBM on tick data. The answer in the corpus and in the recent literature is: rarely, and only with carefully designed input encodings.

A comparative fit table

Family Capital req Latency req Team-size fit Crypto-2026 edge Latency arb (tier-1) High Extreme 5+ engineers, co-lo Shrinking Latency arb (tier-3 / cross-venue) Medium High 2–3 engineers Real Triangular arb Low–Medium High 1–2 engineers Tier-3 only Market making (Avellaneda-Stoikov) Medium Medium 2–3 engineers Strong Cross-exchange MM Medium High 2–3 engineers Strong Statistical arb (basis) Medium Low 1–2 engineers Crowded but durable Funding-rate arb Medium Low 1 engineer Steady Liquidation hunting (perps) High Medium 2–3 engineers Real, data-heavy Predictive ML signals Medium Medium 2 engineers + 1 data scientist Hard but possible

A two-person Rust+Python shop that picks two of these families and ignores the rest will outperform a six-person shop that tries all of them. Specialisation is not glamorous but it is profitable.

§3 — The Python + Rust polyglot

The mid-tier has consolidated around a polyglot stack. The reasons are unromantic and well-documented in the corpus.

Why pure C++ lost ground

Pure C++ still wins at the very top end — the sub-microsecond co-located equities desks. It loses in the mid-tier because three things changed in the last decade. The talent pool of 25-year-olds who can ship safe production C++ has shrunk. Dev velocity in C++ is genuinely poor for a small team — what takes a week in C++ takes a day in Rust + Python. And the marginal microsecond C++ might buy you is invisible in crypto, where the matching engines themselves run at 100 µs to 1 ms.

Rust gives you the predictability — no GC pauses, no allocator stalls in the wrong moment, exhaustive pattern matching so the compiler catches the off-by-one before it reaches a fill. Python gives you the velocity — every research notebook, every backtest, every monitoring dashboard, every MCP agent is faster to write in Python than in any compiled language.

Why pure Python isn’t enough

The Python Global Interpreter Lock is the most-cited reason and one of the most-misunderstood. Building Generative AI Services with FastAPI makes the point concisely: “In Python, the CPU time can be allocated to only one task at any moment because of Python’s Global Interpreter Lock (GIL). Python’s GIL allows only one thread” to actively execute Python bytecode at any given instant. For a CRUD web service this is a non-issue — async I/O carries the load. For HFT, two consequences matter:

You cannot get true CPU parallelism in pure Python. No matter how clever your asyncio is, two CPU-bound coroutines cannot run on two cores. The orderbook decoder and the signal computer both want CPU; they cannot share Python.
GC pauses and allocator behaviour are uncontrolled. A dict resize in the wrong moment costs you 200 µs of jitter. The reference-counting GC stalls on cyclic deallocation. Sub-millisecond tail-latency in pure Python is not achievable in 2026.

The polyglot answer is to put everything CPU-bound or jitter-sensitive in Rust, and use Python as the conductor.

Tail-latency theory

Latency: Reduce Delay in Software Systems puts it concisely: “percentiles tail latency [is what we mean] because the percentiles are at the tail of the latency distribution. If you have measured latency, you’ve almost certainly observed the tail latency but dismissed it” as noise. This is the load-bearing intuition for HFT operators. The mean is always fine; you die on the tails. p99 and p999 are the metrics that move P&L. If you only monitor the mean, you will never see the tail latency that adversely-selects you on every trend day.

The corpus’s framework decomposes latency into named segments — network, kernel, user-space, application, wire — and assigns a budget per segment. For a crypto HFT system that wants 1 ms tick-to-trade, the budget looks roughly like this:

Segment Budget Implementation NIC arrival → kernel 0–5 µs Linux io_uring, kernel-bypass not necessary at 1 ms WebSocket frame decode 5–15 µs Rust + simd-json Orderbook update 1–5 µs Rust, lock-free, custom Signal compute 5–30 µs Rust, SIMD where possible Strategy decision (Python branch) 100–1000 µs Python asyncio, called via PyO3 Order encode 5–15 µs Rust NIC send 0–5 µs Same as arrival Total tick-to-trade ~150–1100 µs

Two observations. First, the Python branch is by far the largest segment but it is not on the signal-to-quote path — quote updates happen in Rust based on the most recent Python-issued parameters. Second, none of these numbers are aspirational; they are achievable on a 2024-vintage AWS c7i instance with no kernel-bypass tricks.

Memory allocator analysis

Latency: Reduce Delay in Software Systems covers memory pools as a fundamental latency-hiding technique: “instead of allocating and deallocating memory to perform some work, you borrow memory from a memory pool. Of course, memory pools have similar challenges to thread pools, where sizing the memory pool” must match the working set. For an HFT engine, three allocator choices matter:

glibc malloc — the default. Allocates from per-thread arenas, fragments under sustained pressure, has unbounded worst-case latency. Avoid in the hot path.
jemalloc or mimalloc — better tail behaviour, predictable arenas, lower fragmentation. The Rust equivalent is the global allocator override (#[global_allocator]).
Memory pools — pre-allocate bounded regions for orderbook entries, fill-tickets, and decode buffers. Touch nothing on the hot path. This is the production-discipline pattern.

For a Rust HFT engine, the practical answer is: jemalloc as the global allocator, plus per-component arenas for orderbook nodes and fill records, plus bounded ring buffers for IPC. The latency book’s framework — “size the pool to your working set, not your peak” — becomes operational discipline.

PyO3 deep-dive

PyO3 is the Rust ↔ Python bridge. A minimal example:

// src/orderbook.rs
use pyo3::prelude::*;
use pyo3::exceptions::PyValueError;

#[pyclass]
pub struct OrderBook {
    bids: Vec<(f64, f64)>,
    asks: Vec<(f64, f64)>,
}
#[pymethods]
impl OrderBook {
    #[new]
    fn new() -> Self {
        Self {
            bids: Vec::with_capacity(1024),
            asks: Vec::with_capacity(1024),
        }
    }
    fn apply_delta(&mut self, side: &str, price: f64, qty: f64) -> PyResult<()> {
        if !price.is_finite() || price <= 0.0 {
            return Err(PyValueError::new_err("non-finite or non-positive price"));
        }
        let book = if side == "bid" { &mut self.bids } else { &mut self.asks };
        // merge / replace / remove logic; lock-free path elided for brevity
        Ok(())
    }
    fn microprice(&self, depth: usize) -> Option<f64> {
        if self.bids.is_empty() || self.asks.is_empty() {
            return None;
        }
        // size-weighted midprice over the top-N levels
        let sum_bids: f64 = self.bids.iter().take(depth).map(|(p, q)| p * q).sum();
        let sum_asks: f64 = self.asks.iter().take(depth).map(|(p, q)| p * q).sum();
        let qty_bids: f64 = self.bids.iter().take(depth).map(|(_, q)| q).sum();
        let qty_asks: f64 = self.asks.iter().take(depth).map(|(_, q)| q).sum();
        Some((sum_bids + sum_asks) / (qty_bids + qty_asks))
    }
}
#[pymodule]
fn ob_core(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_class::<OrderBook>()?;
    Ok(())
}

# strategy/loop.py
import ob_core

book = ob_core.OrderBook()
async for msg in ws_stream:
    book.apply_delta(msg.side, msg.price, msg.qty)   # ~1 µs in Rust
    mp = book.microprice(5)                           # ~2 µs
    if mp is not None:
        await quote(mp)                                # Python branch

Three traps to know about:

Refcount leaks across the FFI. PyO3 hands out Py<T> wrappers that the Python side reference-counts. If your Rust code stores a Py<T> and never drops it, Python's GC cannot collect — leaks accumulate slowly and surface as memory growth on day three.
Panic propagation. A Rust panic across the FFI becomes a Python exception. This is correct, but it means your Rust hot-path unwrap() becomes a Python RuntimeError at the worst possible moment. Use Result exhaustively in the hot path; never unwrap.
GIL hold time. Anything in #[pymethods] runs while holding the GIL. If you do CPU-heavy work, release the GIL with py.allow_threads(|| { ... }) so other Python threads can progress.

Inter-process communication

Three IPC mechanisms, with measured roundtrip latencies on a single host:

Mechanism Roundtrip Throughput When Shared-memory ring buffer (e.g. iceoryx2, hand-rolled with memmap2) 5–10 µs Very high Ingestor → strategy on same host ZeroMQ (zmq Rust + pyzmq Python) 10–50 µs High Strategy → executor across processes gRPC 500–2000 µs Moderate Control plane only — not the hot path

The unsexy truth: most teams should start with ZeroMQ. Shared-memory is a year-three optimisation. gRPC has its place — for management endpoints, config reload signals, and metrics — but never on the trade path.

Async runtime selection

Tokio is the default. For aggressive thread-per-core architectures, glommio (based on Linux io_uring) gives you better tail latency at the cost of less ecosystem. The 2024 onwards trend is tokio-uring, which gives you io_uring underneath the standard tokio ergonomics. Choose tokio for almost everything; reach for glommio only when p999 matters more than dev-velocity.

Crate picks

Concrete picks with rationale, current as of 2026:

simd-json over serde_json — about 3× faster on JSON-heavy WebSocket streams. Crypto venues all use JSON; this is the single biggest decode-latency win.
tokio-tungstenite for general WebSocket; fastwebsockets when the venue has high message rates and you need the absolute floor on decode time.
crossbeam for lock-free queues and atomic primitives.
bytes for zero-copy buffer slicing across decode + IPC boundaries.
tokio-uring when on Linux and you want io_uring under the tokio API.

A “learn this much Rust” curriculum for the Python quant

A pragmatic minimum to be productive on a polyglot HFT stack — about three weeks of deliberate practice:

Ownership, borrowing, lifetimes — the language-defining concepts (week 1)
Result, Option, error propagation with ? (week 1)
Tokio basics — async fn, Mutex, Notify, channel (week 2)
PyO3 — exporting types, error mapping, GIL handling (week 2)
Profiling — perf, flamegraphs, allocator instrumentation (week 3)

Skip macros, advanced traits, async-trait gymnastics until you have shipped something. The 80% you need is the boring 20% of the language.

Polyglot-stack failure modes

Failure Symptom Cure GIL-induced pause in async loop Strategy decision latency spikes p99 → p999 ratio Move CPU-bound work into Rust; release GIL with allow_threads FFI panic propagates as RuntimeError Strategy crashes on first malformed message Result exhaustively; never unwrap in hot path Refcount leak across PyO3 Memory grows over days Audit Py<T> lifetimes; use weak references where possible Allocator stalls p99 latency drift over hours Switch to jemalloc; pre-allocate pools WS reconnect bug Stale orderbook for 200 ms after disconnect Chaos-monkey the ingestor; force gap-recovery in tests

§4 — Crypto microstructure 2026

A crypto venue in 2026 is a textbook continuous-double-auction matching engine wrapped in three layers of API. Practical Guide describes the matching engine archetype: “a large market sell order placed earlier by algorithm B arrives at the trading venue’s matching engine. t = 12:13:01:005618: A market sell order placed earlier by algorithm C arrives” — strict price-time priority, deterministic ordering, microsecond-resolution timestamps. The physics is the same as in equities. Everything around it is venue-specific.

Fee schedules — the table that drives strategy economics

Maker-taker fees are the single biggest determinant of which strategies a venue can host. Numbers below are current as of May 2026, drawn from each venue’s published fee page; verify before sizing capital because the schedules change quarterly.

Venue Spot maker / taker Perp maker / taker Notes Binance 0.075% / 0.075% 0.02% / 0.05% BNB discount available; VIP tiers Bybit 0.10% / 0.10% 0.02% / 0.055% BIT discount; aggressive MM rebates at top tiers OKX 0.08% / 0.10% 0.02% / 0.05% VIP 8 maker fee can hit -0.01% (rebate) Hyperliquid n/a (perps focus) 0.015% / 0.045% No maker rebate but lowest taker among major venues; see Hyperliquid Docs — Fees dYdX v4 n/a 0% / 0.05% (base) Designed to attract MMs; epoch-based rebates Coinbase 0.40% / 0.60% (retail) → near-zero (Advanced) 0.02% / 0.05% (US perps) Tiered

For a market-making strategy quoting a 1 bp spread, the difference between a 0.015% and a 0.10% taker fee is the entire P&L. Strategy economics are venue economics. The decision of which venues to support is a business decision, not an engineering one, and it is the single highest-leverage decision the team makes.

Order types you actually use

The marketing list of order types is ten items long. The list you actually use is six.

Post-only — guarantees you never cross the spread, never pay taker fees. The MM workhorse. Different venues use different rejection semantics; Binance calls the variant GTX ("good-till-crossing").
IOC (immediate-or-cancel) — the taker workhorse for arbitrage exits and forced unwinds.
FOK (fill-or-kill) — rarer; used when partial fills break your hedge ratio. Most venues implement it as IOC with a minimum-quantity flag.
Reduce-only — perpetual essential. You can never increase a position on a reduce-only fill, only flatten it. Saves you when your sizing logic has a bug. Use it religiously on the unwind side.
Hidden / iceberg — display only a small visible portion; the rest is non-displayed inventory revealed in slices as fills occur. Practical Guide notes that “iceberg orders […] allow limit-order traders to display only a portion of their order in the limit order book, and keep the” rest hidden.
OCO (one-cancels-other) — paired stop and limit; venue-side bracket logic. Useful for risk-managed swing positions but rarely for HFT.

Tick size and lot precision

Tick size — the minimum price increment — determines whether MM strategies are viable on a given symbol. A tick of 0.01 USDT on a 60 000-USDT symbol is 0.0166 bp; a tick of 0.1 is 0.166 bp. The ratio of bid-ask spread to tick size determines whether you can quote profitably with a one-tick edge or need to skip levels. Most major venues use tighter ticks for higher-volume symbols and looser ticks for lower-volume; the practical effect is that the thin venues you’d want to MM are exactly the venues with the worst tick economics.

Lot precision is the symmetric problem on quantity. A precision of 0.001 BTC on a 60 000-USDT symbol is 60 USDT — an MM that wants to quote 30 USDT per side simply cannot.

Sequence-gap recovery

Every WebSocket market-data feed eventually gaps. The book updates arrive with sequence numbers, and when a sequence is missing, the venue specifies a recovery sub-protocol — Binance’s is different from Bybit’s, OKX’s is different again. Each protocol works correctly on the happy path. Each is broken on at least one edge case. Build a chaos-monkey for your own ingestor and force-test every gap-recovery path. The reconnect logic will be the source of your worst-ever P&L disaster if you don’t.

The pragmatic pattern: maintain a snapshot ID and a sequence counter. On gap detection, drop all open quotes within 5 ms (faster than any informed flow can pick you off), then refetch the snapshot via REST, then re-subscribe to the diff stream from the snapshot’s sequence. Anything else is theatre.

Funding rates as a structural HFT input

Perpetual swaps have no expiry. To keep the perp price tethered to spot, venues implement a funding rate: long holders pay short holders (or vice-versa) at a fixed cadence. The standard formula is approximately:

funding_rate = clamp(premium_index + interest_rate_diff, -0.75%, +0.75%)
where premium_index = (perp_mid - spot_index) / spot_index

The cadence varies — Binance and Bybit settle every 8 hours, Hyperliquid and dYdX v4 every 1 hour. The 1-hour cadence creates more granular carry opportunities and more frequent timing-edge moments around the funding snap.

A market-neutral funding harvest — long spot + short perp when funding is positive — is one of the few reliable carry strategies in crypto. It is also crowded, so your edge has to come from latency on entry/exit and sizing into the cap. The “edge” is timing the snap, not the steady-state carry.

CEX-DEX latency profile

A spot trade on Binance settles in microseconds at the matching engine. A swap on Uniswap v3 settles in one Ethereum-block — in 2026 that’s 12 seconds on mainnet, 2 seconds on Base, sub-second on Arbitrum and Optimism. Bridges between chains add 1–10 minutes of finality, depending on bridge.

The implication for HFT: pure on-chain HFT is a different sport with different latency rules. Cross-CEX-DEX strategies live on the bridges. And the rise of intent-based protocols — UniswapX (Dutch-auction off-chain orders, executed by competing fillers) and CoW Swap (batch auctions with coincidence-of-wants matching, settled by competing solvers) — has shifted the on-chain liquidity model from AMM-only to AMM-plus-solver-network. UniswapX in 2026 “handles gasless single-chain swaps, cross-chain trades, and MEV-protected execution for retail wallets” (eco.com — UniswapX guide, accessed May 2026); CoW Swap “collects orders for roughly 30 seconds, bundles them into a batch, and auctions the right to settle the batch off to a network of competing solvers” (cow.fi documentation, accessed May 2026).

MEV taxonomy for the trader’s perspective

MEV is both adversary (your transaction can be sandwich-attacked) and opportunity (you can be the one extracting MEV). The taxonomy as of 2026:

Sandwich attack — front-run + back-run a victim’s swap to extract slippage. A 2026 academic study found that “naive heuristics overstate sandwich activity, with the majority of flagged patterns being false positives and the median net return for these attacks being negative” (arXiv:2601.19570, accessed May 2026) on rollups with private mempools — the easy edge has been compressed.
Back-run / arbitrage closure — the legitimate MEV; capturing the price disagreement that any trade creates.
JIT (just-in-time) liquidity — provide liquidity exactly at the swap moment, capture LP fees, withdraw immediately. Legal but adversarial to passive LPs.
Liquidation MEV — be the liquidator on a margin-blown position; this is structurally legitimate.

In 2026 most MEV has moved off mainnet to L2s. Base (46.58% of L2 TVL) and Arbitrum (30.86%) dominate (blockeden.xyz — L2 consolidation, accessed May 2026), and on L2s the sequencer holds significant MEV power because it controls transaction ordering — which is currently centralised on most rollups.

Queue position and adverse selection in crypto

Practical Guide explains the queue mental model: “this queue can be thought of as a line for airport check-in. Unlike the airport line, however, the queue often has a finite length or capacity; therefore, any quote arrivals” beyond the cap get rejected or pushed to the next price level. The implication for crypto: you do not race to the front of every level. You race to be first to a level the market is about to revisit.

Aldridge HFT makes the adverse-selection point that goes with it: “Harris and Panchapagesan [2002] show that market makers able to fully observe the information in [the limit] order book can extract abnormal returns, or ‘pick off’ other limit-order traders” who haven’t moved. In crypto the picking-off is more aggressive than in equities because retail flow is louder, more concentrated, and travels in regime-correlated waves. A passive limit order at the front of the queue during a high-vol moment is a target.

§5 — Market making, the deep dive

This is the load-bearing strategy section. The corpus is dense here — Aldridge and the Practical Guide together are essentially a textbook on the topic, and the recent Market Making in Crypto paper by Stoikov and his coauthors provides the 2024–2026 update.

The bid-ask spread as compensation

A market maker quotes a buy and a sell. The fundamental theorem is that the spread compensates the MM for two costs: inventory holding cost and adverse-selection cost. The Glosten-Milgrom 1985 model formalised the second component; in Aldridge HFT the result is paraphrased as: “one outcome of Glosten and Milgrom (1985) is that in the pre[sence of] a large number of informed traders, a market maker will set unrea[sonably] high spreads in order to break even.” If a fraction of incoming flow is informed (i.e. trades on private information about the future price), the MM loses on every informed trade and must widen spreads against uninformed flow to compensate.

The Avellaneda-Stoikov 2008 model is the operational complement, focused on the first component (inventory cost). It assumes a representative MM with finite inventory tolerance who wants to maximize expected utility over a finite horizon. Aldridge HFT describes the result: “for fully rational, ‘risk[-averse’]’ traders, the strategy of Avellaneda and Stoikov (2008) also outp[erforms] the ‘symmetric’ bid and ask strategy whereby the trader places […] ask limit orders that are equidistant from the [mid-price].” The asymmetry — quoting around a reservation price rather than the mid — is what makes the model work.

The Practical Guide states the MM’s job description in plain language: “as inventory [accumulates], the market maker begins to manage it, to reduce risk and enhance profitability. The two broad functions of a market maker are therefore: ■ Manage inventory” and ■ adversely-select less than they are adversely-selected against. Inventory management is the topic; the formula is the tool.

The Avellaneda-Stoikov derivation, walked through

The Avellaneda-Stoikov result is widely cited and infrequently derived. Walking through it makes the parameters meaningful.

Step 1 — the MM’s utility. Assume the MM has exponential utility (constant absolute risk aversion):

U(W, q) = -exp(-γ · (W + q · S))

Where W is cash, q is signed inventory (positive = long), S is mid-price, and γ > 0 is the risk-aversion parameter. Higher γ means more aversion to inventory.

Step 2 — the value function. The MM chooses bid and ask quotes δ_b and δ_a (half-spreads from mid) to maximize expected utility from now until horizon T. The value function v(t, S, W, q) satisfies a Hamilton-Jacobi-Bellman (HJB) equation:

v_t + (½σ²) v_SS + max_{δ_b, δ_a} { λ(δ_b)·[v(t, S, W − S + δ_b, q+1) − v] + λ(δ_a)·[v(t, S, W + S + δ_a, q−1) − v] } = 0

Where λ(δ) is the Poisson arrival rate of fills as a function of the half-spread (wider spread → fewer fills, exponentially decaying).

Step 3 — the reservation price. Avellaneda and Stoikov observed that under the exponential-utility ansatz, the value function factorises and reduces to a closed-form for the reservation price — the price at which the MM is indifferent between holding her current inventory and not:

r(s, q, t) = s − q · γ · σ² · (T − t)

Where s is the current mid, q is signed inventory, γ is risk aversion, σ² is mid-price variance, and T − t is time-to-horizon.

Step 4 — the optimal half-spread. The optimal half-spread (around the reservation price, not around the mid) is:

δ* = (γ · σ² · (T − t)) / 2 + (1/γ) · ln(1 + γ/k)

Where k is the order-flow intensity calibration constant.

Step 5 — the actual quotes.

quote_bid = r − δ*
quote_ask = r + δ*

The intuition. When q > 0 (long inventory), r < s — the reservation price drops below mid. Both quotes shift down. The ask becomes more attractive; the bid becomes less. The market hits the ask, your inventory drops back toward zero. When q < 0, the reverse. The model self-corrects inventory by quoting against the inventory, not by hedging it after the fact.

A concrete numerical example. Suppose s = 60 000 USDT, σ = 0.001 per tick (vol of mid in tick units), γ = 0.1, q = +5 BTC (5 BTC long), T − t = 1 hour ≈ 3 600 s, k = 1.5. Then:

σ² · (T − t) = 0.000001 · 3600 = 0.0036
r = 60 000 − 5 · 0.1 · 0.0036 · 60 000 = 60 000 − 10.8 = 59 989.2
δ* = (0.1 · 0.0036) / 2 + (1/0.1) · ln(1 + 0.1/1.5)
   = 0.00018 + 10 · ln(1.0667)
   = 0.00018 + 0.6453
   ≈ 0.6455 in tick units, ≈ 38.7 USDT in dollar terms (roughly)

So the MM quotes bid ≈ 59 950 USDT, ask ≈ 60 028 USDT — a meaningful skew in the direction of unwinding the long inventory.

Real systems use very different γ and shorter horizons. The point of the example is the direction of the shift, not the magnitude.

Inventory-based parameter calibration

The model has three parameters to calibrate from data: γ, σ², k. Market Making in Crypto (Stoikov et al., December 2024) walks through a calibration procedure on crypto perpetuals using the open-source Hummingbot platform. The headline contributions: an alpha signal called Bar Portion derived from candlestick data that improves directionality, and a calibration framework that estimates k from observed limit-order arrival rates rather than treating it as a free parameter. The follow-up paper "Logarithmic regret in the ergodic Avellaneda-Stoikov market making model" (arXiv:2409.02025, accessed May 2026) shows that a maximum-likelihood estimator achieves logarithmic regret bounds when learning the price-sensitivity parameter k online — meaning a properly-calibrated MM converges fast.

In practice for a small shop:

σ² — estimate from a 5-minute rolling window of mid-price changes. Refresh every 100 ms in production.
γ — set by risk-budgeting decision, not by data. A reasonable starting value is one that produces tolerable expected drawdown under your inventory cap. Tune up or down based on observed losses; never let the agent tune this.
k — fit online via MLE on observed fills. Stoikov 2024 et al. provide the recipe.

Cross-exchange MM — the most-profitable variant

A robust pattern in crypto: quote on the thinner venue, hedge on the deeper venue. You earn the thinner venue’s spread (which is wider, by definition) and pay the deeper venue’s taker fee for the hedge. Net of fees and latency, this is durably profitable on crypto if you pick the venue pair carefully and your hedge latency is sub-100 ms.

Concrete example: quote BTC-USDT on a tier-2 venue at 1 bp spread, hedge each fill via a market order on Binance perp at 2 bp net (taker fee + half-spread). Net spread = 1 − 0.5 (your half) − 2 (hedge full cost) = -1.5 bp on a fill. This loses on every trade in isolation. The trick is the fee structure: at MM-tier on the thin venue you pay -0.005% (a rebate), at top tier on Binance perp you pay 0.005% (a paid maker fee). Re-do the arithmetic with rebates and the trade goes from -1.5 bp to +0.5 bp expected value — that’s the entire business.

Toxicity filters: VPIN

VPIN (Volume-Synchronised Probability of Informed Trading) is the canonical real-time toxicity metric. Practical Guide gives the formula: “to estimate the incidence of a crash, the authors develop a volume-based probability of informed trading, or VPIN metric: VPIN ≈ (1/(nVτ)) · Σ |V_S − V_B|” where V_S and V_B are buy- and sell-classified volumes within volume buckets of size Vτ, summed over n recent buckets.

The operational effect: VPIN spikes when buy and sell volume become asymmetric within a volume window. Asymmetric volume implies one-sided informed flow, implies the MM is about to be picked off. When VPIN crosses a threshold (typically the 80th or 90th percentile of historical VPIN for the symbol), the MM should widen quotes or pull quotes entirely.

The known weakness is volume-bucketing bias: the metric depends on how you classify volume into buy/sell (Lee-Ready algorithm, BVC, tick rule), and the classification is itself imperfect. In high-frequency crypto data with frequent quote sweeps, the bucketing introduces noise. The pragmatic fix is to compute VPIN with two classifiers and pull quotes when both trigger.

Toxicity filters: Kyle’s lambda

Kyle’s lambda comes from Kyle (1985). Aldridge HFT describes it: “Kyle (1985) analyzes how a single informed trader could best take[…] advantage of his information in order to maximize his profits. Kyle (1985) describes how information is incorporated” into prices through trade. The lambda — λ — is the slope of the price-impact function: how much the mid moves per unit of net order flow. Higher lambda means each unit of flow moves the price more, which means the market is being pushed by an informed trader.

In code, a rolling Kyle’s lambda regression looks like:

# rolling 60-second window
def kyle_lambda(trades_df):
    # signed_volume[t] = volume × (+1 if buy, -1 if sell)
    # delta_mid[t] = mid[t+1] - mid[t]
    X = trades_df['signed_volume'].values.reshape(-1, 1)
    y = trades_df['delta_mid'].values
    # OLS: delta_mid = alpha + lambda · signed_volume
    coef = np.linalg.lstsq(X, y, rcond=None)[0]
    return float(coef[0])

Production discipline: compute λ on a rolling window matching the MM strategy’s reaction time (60 s for slow quotes, 10 s for aggressive ones); when λ exceeds the historical 90th percentile, the strategy should reduce quote size, not pull entirely — Kyle’s λ is slower-moving than VPIN and false positives are costly.

Queue position economics

First in line ≠ best fill. Toxic flow front-runs the front of the queue — the informed traders know you’re there and lift you the moment the price is about to leave. This is why “be first to a level the market is about to revisit” beats “race to the front of every level.”

The policy:

In trending regimes — fall back from the front; let toxic flow exhaust before re-quoting.
In mean-reverting regimes — be first; capture the bounce.
In range regimes — quote symmetrically with moderate aggressiveness; let inventory drift cycle within bounds.

A regime classifier upstream of the MM is therefore not optional. The cheapest classifier is a 30-minute ATR ratio: if realised vol over 30 min divided by realised vol over 4 hours exceeds 1.5, you are in a regime shift; widen quotes and reduce size.

Production code skeleton

A compact AS-quoter that ties the §3 polyglot stack to the §5 mathematics:

# strategy/avellaneda.py
import math
import asyncio
import ob_core   # the Rust extension from §3


class ASMaker:
    def __init__(self, gamma=0.1, k=1.5, horizon_s=300, max_inv=10.0):
        self.book = ob_core.OrderBook()
        self.q = 0.0
        self.gamma = gamma
        self.k = k
        self.T = horizon_s
        self.t0 = None
        self.max_inv = max_inv
        self.vpin_state = ob_core.VPINWindow(window_volume=1_000_000)
        self.lambda_state = ob_core.KyleLambda(window_seconds=60)

    def _toxic(self) -> bool:
        vpin = self.vpin_state.value()
        lam  = self.lambda_state.value()
        return vpin > 0.8 or lam > self.lambda_state.p90_threshold()

    def _quotes(self, s: float, sigma2: float, t_left: float) -> tuple[float, float]:
        r = s - self.q * self.gamma * sigma2 * t_left
        delta = (self.gamma * sigma2 * t_left) / 2 + (1.0 / self.gamma) * math.log(1.0 + self.gamma / self.k)
        return r - delta, r + delta

    async def on_tick(self, msg, t_now: float, sigma2: float):
        self.book.apply_delta(msg.side, msg.price, msg.qty)
        if self.t0 is None:
            self.t0 = t_now
        s = self.book.microprice(5)
        if s is None:
            return
        # hard inventory cap - non-negotiable
        if abs(self.q) >= self.max_inv:
            await self.cancel_all()
            await self.flatten_via_taker()
            return
        # toxicity gate
        if self._toxic():
            await self.cancel_all()
            return
        t_left = max(self.T - (t_now - self.t0), 1.0)
        bid_px, ask_px = self._quotes(s, sigma2, t_left)
        await self.replace_quotes(bid_px=bid_px, ask_px=ask_px, size=self.size_for_inventory())

The hot loop (orderbook, VPIN, lambda) runs in Rust through the ob_core extension; the Python on_tick is called per market-data event but spends most of its time waiting on async I/O — the GIL is released for the duration of the awaits.

Real failure-mode case studies

The MM-in-a-trend disaster. When σ² is mis-estimated (e.g. on a low-vol historical window) and the market enters a trend, the AS quoter accumulates inventory in the wrong direction faster than its skew can unwind it. The cure is twofold: a regime classifier that scales risk off on σ-spikes (not just adjusts the formula), and a hard inventory cap that cancels quotes (the max_inv branch above) rather than just adjusting them.
The Hyperliquid HLP / JELLYJELLY incident, March 2025. A whale opened a short position on JELLYJELLY on Hyperliquid while simultaneously dumping the spot token on a DEX, crashing the on-chain price. Hyperliquid’s HLP (the protocol-owned market-making vault) was forced to take over the short, then the whale bought spot to squeeze the short, driving the token up by ~250%. The HLP ended up with roughly $13.5M in unrealized losses (cryptonews.com, accessed May 2026; The Block, accessed May 2026). The validators voted to delist JELLY perps; users (apart from flagged addresses) were made whole from the Hyper Foundation. The architectural lesson: a single-venue MM that cannot move inventory off-venue is a single-venue MM that can be cornered. Cross-venue MM with a hedging pipe is structurally safer.
Latency-out adverse selection. Your WS feed lags by 200 ms during a vol event; you quote stale prices for 200 ms; informed flow lifts you on every quote. By the time the WS recovers your inventory is blown. Cure: stale-feed detector with millisecond resolution and an auto-cancel on staleness — the strategy should never wait for the WS to recover before pulling quotes.
Regime-shift adverse selection. You calibrated γ, k, σ² on a low-vol regime; vol triples; the formula's parameters are wrong; you lose your shirt before recalibration. Cure: continuous online recalibration of σ² and k; never assume yesterday's calibration applies today.

§6 — Other HFT strategies on crypto

Statistical arbitrage

Pairs (BTC-ETH, BTC-SOL), baskets (alts versus BTC.D), or basis (perp-spot). The classical approach:

Cointegration test (Engle-Granger or Johansen) to confirm a stable long-run relationship between two series. Aldridge HFT references Engle’s foundational work (Engle 1982, 2000) on time-series econometrics that underpins these tests.
Estimate the spread as spread = price_A - β · price_B where β is the regression coefficient.
Estimate the half-life of mean reversion using an Ornstein-Uhlenbeck fit: dS = θ(μ - S)dt + σ dW; half-life = ln(2)/θ.
Trade signal: enter when |spread — μ| > k·σ_spread; exit when |spread — μ| < ε.

The 2026 crypto specifics:

BTC-ETH is the canonical pair. Cointegration is intermittent — works during stable regimes, breaks during alt-season rotations. Half-life is typically 4–12 hours.
BTC perp-spot basis is the dominant durable trade. Long spot, short perp when basis is positive; harvest the funding rate as it converges.
Alt rotation trades pegged to BTC dominance work but are crowded; the edge is in execution, not in the signal.

Triangular arbitrage

The cleanest pedagogical HFT strategy: pure math, no forecasting, three legs of post-only IOC orders. Practical Guide defines the canonical case: “triangular arbitrage exploits temporary deviations from fair prices in three foreign exchange” pairs. In crypto the natural triangle is BTC/USDT × ETH/BTC × ETH/USDT on a single venue.

The closure condition:

edge = (BTC/USDT) × (ETH/BTC) − (ETH/USDT) ≠ 0   (after fees and slippage)

In practice on tier-1 venues (Binance, Bybit, OKX) the triangular spread is closed within microseconds — pure latency arb territory. Where it still pays in 2026: tier-3 venues with retail-grade matching, and on tier-1 during the chaos of a large liquidation cascade when the engine momentarily lags between the three pairs.

A compact closure code:

# strategy/triangular.py
async def close_triangle(venue, lots: float, fee_pct: float = 0.0002):
    px = await venue.snapshot_top_of_book(["BTC/USDT", "ETH/BTC", "ETH/USDT"])
    btc_usdt = px["BTC/USDT"].ask
    eth_btc  = px["ETH/BTC"].ask
    eth_usdt = px["ETH/USDT"].bid

    implied_eth_usdt = btc_usdt * eth_btc
    edge_per_lot = (implied_eth_usdt - eth_usdt) - 3 * fee_pct * eth_usdt
    if edge_per_lot <= 0:
        return None
    orders = await asyncio.gather(
        venue.send_ioc("BUY",  "BTC/USDT", lots,            px["BTC/USDT"].ask),
        venue.send_ioc("BUY",  "ETH/BTC",  lots * eth_btc,   px["ETH/BTC"].ask),
        venue.send_ioc("SELL", "ETH/USDT", lots * eth_btc,  px["ETH/USDT"].bid),
    )
    return orders, edge_per_lot

The send-three-IOCs-in-parallel pattern is the right one. Any sequenced execution gives the market 5–10 ms to close the edge against you.

Latency arbitrage

Three subvariants with concrete venue pairs:

CEX vs CEX — BTC perp on Binance vs Bybit. Edge = price gap × volatility − total roundtrip latency × volatility − fees. Co-location requirement: same AWS region as both venues’ matching engines (typically ap-northeast-1 for Binance, eu-central-1 for Bybit, which means you cannot do this from a single region — pick one and accept asymmetry).
Perp vs spot — BTC perp Binance vs BTC spot Binance. The perp moves faster on news; spot lags by 50–200 ms on retail-driven events. Capture the basis snap.
CEX vs DEX — same asset on Binance vs Uniswap. The DEX lags by an entire block (12 seconds on mainnet, 2 seconds on Base). The edge here is huge but execution-bound: you have to win the next-block transaction, which means competing with MEV searchers in the public mempool or going through a private mempool / intent system.

Funding-rate arbitrage

Long spot + short perp; harvest the funding rate spread when funding is positive. The carry is funding_rate × position_size × cycles_per_day minus borrow cost on the spot leg minus basis drift.

Capacity analysis: the trade is small per pair (you’re limited by the spot venue’s lending market depth or by your own borrow capacity), but it stacks across pairs. A diversified funding-rate book across 20–30 perp symbols can be a meaningful portion of a small shop’s P&L.

The “edge” beyond the steady-state carry is timing the entry and exit around the funding snap. Funding settles at fixed times; right after settlement is the cheapest entry (premium just paid out, basis collapsed). Right before settlement is the safest exit (locked-in funding accruing, basis stable).

Liquidation hunting on perps

Predictive: model the cluster of liquidations near a price level, position to capture the cascade. This is one of the few crypto-native strategies with no obvious equities analogue.

The model needs three inputs:

Open interest by leverage decile — public on Hyperliquid (transparent on-chain), estimable on CEXes from forced-liquidation tape.
The cascade-trigger function — what fraction of OI gets liquidated as price crosses each level.
The reflexivity multiplier — how many liquidations chain into more liquidations.

The strategy: identify a “thick” liquidation cluster (say, $50M of cumulative liquidation between 58 000 and 59 000 USDT on BTC), front-run by going long below the cluster, ride the cascade up, exit before reflexivity exhausts.

This is data-heavy and capital-heavy; it does not fit a 2-engineer team. It does fit a team with on-chain data infrastructure and access to historical liquidation tape across all major venues.

Defensive: spoofing and iceberg detection

You don’t run these strategies — you defend against them. Practical Guide describes spoofing: “in spoofing, the trader intentionally distorts the order book without execution; in the process” influencing other participants’ decisions. It also describes icebergs: “iceberg orders […] allow limit-order traders to display only a portion of their order in the limit order book, and keep the” rest hidden.

For an MM, spoofing manifests as phantom liquidity that vanishes the moment you’d interact with it — quote-cancel-fill ratios spike for orders that flicker on and off. Icebergs manifest as one-sided pressure from invisible orders — fills happen at sizes larger than the displayed liquidity should allow.

Detection statistics, in production:

Cancel-to-fill ratio per order origin — a counterparty whose cancels-to-fills exceeds 50:1 in an active period is probably spoofing, not making liquidity.
Fill-vs-display delta per level — if fills systematically exceed visible-size by > 2× during a session, an iceberg is at work.

When detection triggers, the response is to widen quotes and reduce size — not to engage. The opposite mistake (chasing the visible price thinking it’s real liquidity) is exactly what spoofers want.

ML signals as a strategy multiplier

Feature engineering for HFT is its own craft. Time Series Analysis with Python Cookbook introduces feature engineering recipes, including “detecting contextual outliers with feature engineering” as a chapter-level focus, and walks through sktime-based pipelines that combine exogenous variables and ensemble learning. The HFT-specific feature catalog is short and well-known:

Order-flow imbalance at top-of-book, top-3, top-5 levels
Microprice (size-weighted mid)
Spread compression rate (how fast the spread is closing)
Cancel rate per side
Trade-flow autocorrelation at 1s, 5s, 30s lags
Volatility regime (realised vs realised-EMA)

Supervised learning on these features predicts next-tick mid-price changes with marginal AUC over 50%. RL approaches treat the feature set as state and learn a quoting policy — dramatically harder to train, occasionally better in production.

Foundation models for time-series are the 2024–2026 frontier. Time Series Forecasting Using Foundation Models opens by saying: “the transformer architecture was proposed [for natural language but we now] study the transformer architecture from a time-series forecasting point of view.” The pragmatic answer for HFT: foundation models are useful for regime classification and macro-feature generation, not for tick-by-tick prediction — the time-scales mismatch. A 100M-parameter transformer is not faster than LightGBM at 1 ms inference, and at HFT inference latency is half of the value.

§7 — Trading automation via MCP / GenAI connectors

Anthropic introduced the Model Context Protocol in November 2024 as an open standard for connecting LLM agents to tools and data sources. By May 2026 the spec is at version 2025–11–25 (modelcontextprotocol.io/specification/2025–11–25, accessed May 2026). The 2026 roadmap focuses on three areas: streamable HTTP transport (so MCP servers can run as remote services rather than local processes), task primitives for long-running asynchronous work, and enterprise readiness — audit trails, SSO-integrated auth, gateway behaviour, configuration portability (blog.modelcontextprotocol.io — 2026 roadmap, accessed May 2026). Official SDKs exist for Python, TypeScript, C#, Java, Kotlin, and PHP; community SDKs cover Rust and Go.

The corpus’s Hands-On Machine Learning with scikit-learn and PyTorch references the protocol directly in its agent-orchestration chapter — and notes the load-bearing rule that any production deployment must adopt: “LLMs are often unreliable, so let’s keep humans in the loop for important matters, shall we?” The exact form of the human-in-the-loop discipline is the single most important rule of using LLM agents in trading.

Agents are researchers, not executors

Live order placement requires deterministic policy gates and human approval. Period.

Every agent-action has to be logged with input + chain-of-thought + tool calls — every. single. one. If you can’t reproduce why an agent did something, you can’t operate it. The pattern from Machine Learning Platform Engineering of routing queries through an LLM (result = self.router_llm.invoke(routing_prompt)) is a research-time pattern; it never flows to a live order endpoint.

Reference architecture

┌──────────────────────────────────────────────────────────────┐
│  Agent (Claude Opus 4.7 / GPT-5 / Llama-X)                   │
│      │                                                       │
│      │  MCP protocol (stdio or streamable HTTP)              │
│      │                                                       │
│      ├──→ exchange-data MCP    (read: L2 books, trades)      │
│      ├──→ on-chain MCP         (read: mempool, defillama)    │
│      ├──→ news/social MCP      (read: filtered firehose)     │
│      ├──→ knowledge-base MCP   (read: corpus search)         │
│      ├──→ backtest-runner MCP  (read+exec: your framework)   │
│      └──→ research-notes MCP   (write: append-only log)      │
└──────────────────────────────────────────────────────────────┘
                                                 │
                                                 ▼
                                       ┌────────────────────┐
                                       │  Policy gate +     │
                                       │  human review      │
                                       └────┬───────────────┘
                                            │ (manual approval)
                                            ▼
                                       ┌────────────────────┐
                                       │  Strategy engine   │
                                       │  (Rust + Python)   │
                                       └────────────────────┘

Connector inventory in detail

For each connector, the schema and side-effects matter more than the prose.

Exchange-data MCP — one server per venue. Read-only.

get_book(venue, symbol, depth) → orderbook snapshot
get_trades(venue, symbol, since) → trade tape window
get_funding(venue, symbol) → current funding rate + history
get_klines(venue, symbol, tf, n) → OHLCV
Side-effects: none. Cache-able. Behind the scenes, the server reads from your Rust ingestor’s snapshot — it does not hit the venue API directly.

On-chain MCP — read-only.

defillama_tvl(protocol) → TVL by protocol
mempool_pending(target_addr, since_block) → pending transactions
etherscan_tx(hash) → transaction details
Side-effects: none. Rate-limited on the upstream side; cache aggressively.

News/social MCP — filtered firehose, read-only.

news_recent(symbols, hours, sources) → curated headlines
social_sentiment(symbol, window) → aggregated sentiment scores
Side-effects: none. The filtering — keyword + source-credibility weighting — runs as a separate small ML model upstream of the MCP. Never give an agent the raw social firehose; the cost-budget will explode and the signal-to-noise is awful.

Knowledge-base MCP — local document corpus.

kb_search(query, top_k) → ranked passages from your library
kb_page(doc_id, page) → full page text
Side-effects: none. This is the connector that lets the agent ground claims in your library — the External contracts discipline from cognitive-self-check practice in code.

Backtest-runner MCP — read + execute on your own infrastructure.

run_backtest(strategy_config, data_window, mc_seeds) → backtest results
list_runs(filters) → past run inventory
Side-effects: spins up a backtest job. Token-budget bounded; every run carries a manifest of which agent invoked it and why.

Research-notes MCP — write-only, append-only.

note_append(agent_id, hypothesis_id, content) → appends a research note
note_list(filters) → list previous notes
Side-effects: writes to an append-only log. Never deletable by the agent. Human reviewers see the notes; the agent never modifies them after writing.

Reference Python skeleton

A minimal MCP server for the exchange-data connector, using FastAPI patterns from the corpus:

# servers/exchange_data.py
from contextlib import asynccontextmanager
from collections.abc import AsyncIterator
from mcp.server import Server
from mcp.types import Tool, TextContent
from fastapi_cache import FastAPICache
from fastapi_cache.backends.redis import RedisBackend
import redis.asyncio as redis
import asyncio


server = Server("exchange-data")
@asynccontextmanager
async def lifespan(_server) -> AsyncIterator[None]:
    redis_client = redis.from_url("redis://localhost:6379")
    FastAPICache.init(RedisBackend(redis_client), prefix="exchdata-cache:")
    try:
        yield
    finally:
        await redis_client.close()

@server.list_tools()
async def tools():
    return [
        Tool(
            name="get_book",
            description="Snapshot of L2 orderbook for a symbol on a venue. Read-only.",
            inputSchema={
                "type": "object",
                "properties": {
                    "venue":  {"type": "string", "enum": ["binance", "bybit", "okx", "hyperliquid"]},
                    "symbol": {"type": "string"},
                    "depth":  {"type": "integer", "default": 20, "maximum": 100},
                },
                "required": ["venue", "symbol"],
            },
        ),
    ]

@server.call_tool()
async def call(name: str, args: dict):
    if name == "get_book":
        snapshot = await book_cache.get(args["venue"], args["symbol"], args.get("depth", 20))
        return [TextContent(type="text", text=snapshot.to_json())]
    raise ValueError(f"unknown tool: {name}")


if __name__ == "__main__":
    asyncio.run(server.run_stdio(lifespan=lifespan))

The cache pattern is from Building Generative AI Services with FastAPI, which gives the canonical install-and-configure recipe: “you can install FastAPI cache using the following command: pip install \"fastapi-cache2[redis]\" ... configuring FastAPI cache lifespan ..." The book continues with a Redis-backed lifespan manager exactly matching the snippet above. Caching is non-optional for the exchange-data MCP; agents query it ten times per second and the venue API costs would otherwise dominate the budget.

Token-budget circuit breaker

Bounded-cost agent execution is itself a discipline. The pattern:

class BoundedAgentRun:
    def __init__(self, agent, max_input_tokens=200_000, max_output_tokens=20_000):
        self.agent = agent
        self.max_in = max_input_tokens
        self.max_out = max_output_tokens
        self.consumed_in = 0
        self.consumed_out = 0

    async def step(self, prompt: str):
        if self.consumed_in + len(prompt) > self.max_in:
            raise RuntimeError("agent input budget exhausted")
        response = await self.agent.respond(prompt)
        self.consumed_in += response.input_tokens
        self.consumed_out += response.output_tokens
        if self.consumed_out > self.max_out:
            raise RuntimeError("agent output budget exhausted")
        return response

The error conditions are intentional: an agent that loops on a bad query burns budget. Cap per-task token spend; raise visibly when the cap is hit; never silently truncate.

Provenance manifest

Every artifact the agent produces carries a manifest that lets a human reproduce it:

{
  "hypothesis_id": "h-2026-04-12-0008",
  "agent": "claude-opus-4-7",
  "agent_model_version": "20260301",
  "temperature": 0.2,
  "ingest_sources": [
    "Aldridge, HFT 2nd ed., ch. 7 (Avellaneda-Stoikov)",
    "Aldridge, HFT 2nd ed., ch. 12 (VPIN)",
    "Stoikov et al. 2024, SSRN 5066176"
  ],
  "tool_calls": [
    {"tool": "get_klines", "args": {"venue": "binance", "symbol": "BTCUSDT"}, "ts_iso": "2026-04-12T08:14:23Z"},
    {"tool": "run_backtest", "args": {"config_hash": "sha256:abc123..."}, "ts_iso": "2026-04-12T08:17:01Z"}
  ],
  "result_summary": {"median_sharpe": 1.21, "p05_sharpe": 0.34, "kill": false},
  "human_review_status": "pending",
  "human_reviewer": null,
  "human_review_decision": null
}

Without this, an agent’s “great new strategy” is a black box. With it, you can audit, reproduce, and reject on principle.

Hard rules with reasoning

No write-tools to production trading systems. Ever. Research notes only. Reason: the failure mode of an agent placing a live order is unbounded; the failure mode of an agent writing a bad note is bounded.
Token-budget circuit breakers per task. Reason: an agent on a bad query will loop; without a cap, you wake up to a four-figure API bill.
Provenance manifest on every artifact. Reason: the agent’s output is a draft for human review; the manifest is what makes human review possible.
Sandboxing — MCP servers in separate processes/containers. Reason: the agent talks to servers over stdio or HTTP, never by sharing memory; a compromised connector cannot exfiltrate from another connector.
No agent decision is ever final. Reason: this is the human-in-the-loop principle that Hands-On Machine Learning with scikit-learn and PyTorch names directly.

Anti-patterns

The “agent that places orders” trap. Tempting because it looks like full automation. Catastrophic because the failure mode is unbounded losses. Even with policy gates, the audit trail is fragile. Don’t.
The “agent that drops costs” trap. Agent backtests a strategy without fee/slippage configuration; reports Sharpe 3; you deploy; reality is Sharpe 0.5. The fix is a backtest framework that hard-fails if cost configuration isn’t passed in.
The “best of 1 000 grid points” trap. Agent searches a parameter grid and reports the winner. The 999 losers are noise; the winner probably is too. Demand the median and the 5th percentile across the grid; demand that the worst 25% of the grid be profitable.
Confabulated metrics. Agent invents a metric (“Strategy-Adjusted Risk Ratio”) and reports it favorably. You don’t notice. Cure: every metric in the report is matched against a closed dictionary of allowed metrics; unknown metrics fail the report.

§8 — Backtesting and auto-research with AI

Backtesting is the most failure-prone part of any trading system. HFT-specific compounders make it worse.

Why HFT backtests are uniquely hard

A complete enumeration:

Look-ahead bias — features computed using information that wouldn’t be available at the decision time. Easy to introduce, hard to detect.
Data-snooping bias — you tested 1 000 strategies on the same dataset; the best one is winner’s curse. Marcos López de Prado has written extensively on this; the antidote is the Deflated Sharpe Ratio and combinatorial purged cross-validation, more on this below.
Survivorship bias — strategies that died in 2022 don’t show up in your cleaned dataset; you backtest on 2024 and your distribution is censored upward.
Regime survivorship — the backtest window is the regime that exists in the data. A strategy that works in 2023–2025 may not work in a regime that hasn’t appeared yet.
Queue-position problem — public tick data tells you a trade happened at price X. It does not tell you whether you would have been at the front of the queue at that price. Your fill probability is a function of your historical queue position — which the public tape doesn’t expose.
Self-impact problem — your own orders move the market. A backtest that ignores impact systematically over-estimates Sharpe.
Feature leakage via naming — features named next_5min_return clearly use future information; features named rolling_mean_60s may or may not include the current bar; features named prev_close are benign. The bug is silent and produces beautiful equity curves.

Walk-forward methodology

Market Timing With Moving Averages by Zakamulin gives the canonical recipe: “in an out-of-sample testing procedure, in-sample segment of data can be either rolling or expanding.” The two variants:

Rolling window — train on [t₀, t₀+W], test on [t₀+W, t₀+W+L], slide forward by L and repeat. The strategy adapts to recent regimes; older data is forgotten.
Expanding window (or “anchored”) — train on [t₀, t₀+W+i·L], test on [t₀+W+i·L, t₀+W+(i+1)·L]. The strategy retains all history; better statistical power, slower adaptation.

For HFT, rolling is almost always right. The market regimes that mattered in 2018 don’t apply in 2026. The window size W is the tunable parameter; pick a window long enough for parameter stability (≥ 30 days for daily-cadence strategies, ≥ 5 days for tick-cadence) and short enough to discard regime-stale data.

Monte Carlo perturbation

Once you have a walk-forward backtest, run it 1 000 times with perturbed inputs to get a distribution of Sharpe ratios rather than a point estimate. Chan’s Algorithmic Trading: Winning Strategies and Their Rationale is direct: “unlike Monte Carlo optimization, the historical returns offer insufficient data to determine an optimal leverage that works well for many realizations. Despite these caveats, brute force optimization over the backtest” remains the practical baseline once Monte Carlo perturbation is added on top.

What to perturb:

Fill-or-no-fill on each candidate order (probabilistic queue-position model)
Latency drawn from your empirical latency distribution (not the median)
Bid-ask realisation within the spread (not always at the touch)
Slippage as a function of size and current spread

A strategy that survives 1 000 Monte Carlo paths with median Sharpe > 1 and 5th-percentile Sharpe > 0 is a strategy worth paper-trading. A strategy with median Sharpe 1.5 but 5th-percentile -0.5 is a strategy that will work on average and ruin you on the bad runs.

Combinatorial purged cross-validation

López de Prado’s combinatorial purged cross-validation (CPCV) is the state-of-the-art antidote to the bias compounders for ML-driven HFT strategies. The method “systematically constructs multiple train-test splits, purges overlapping samples, and enforces an embargo period to prevent information leakage” (towardsai.net — CPCV, accessed May 2026; foundational paper SSRN 4778909, accessed May 2026).

The mechanics: divide a time-series dataset into N sequential, non-overlapping groups that preserve temporal order. Then choose all combinations of k groups (k < N) as test sets, with the remaining N − k groups used for training. Purging removes training samples that overlap in time with test samples; embargoing enforces a no-information-flow gap immediately after test windows. The result is a distribution of performance metrics across many backtest paths, enabling the Deflated Sharpe Ratio as a rigorous test statistic.

For HFT specifically, CPCV’s main advantage over walk-forward alone is that you get many paths instead of one, so a single bad regime doesn’t doom or vindicate the entire strategy.

Time-series-specific backtest pitfalls

Time Series Analysis with Python Cookbook covers the pitfalls of train-test splits with autocorrelated data — pure k-fold cross-validation on time-series is wrong because it implicitly leaks future-into-past. Time Series Forecasting Using Foundation Models extends the discussion to transformer-based forecasters, where the typical sequence-modelling tricks (random shuffling, batch construction) violate temporal causality unless explicitly handled.

The pragmatic discipline: for any time-series backtest, the only safe split is sequential. Train on the past; test on the future. Never the reverse. Never random.

The auto-research workflow

The AI part of the title. The workflow as a state machine:

┌─────────────────────────────────────────────────────────────┐
│  1. Agent reads new paper / blog post / corpus chapter      │
│     (via knowledge-base MCP)                                │
│              ↓                                              │
│  2. Drafts a falsifiable hypothesis with provenance         │
│     ("X feature on Y venue should predict Z over τ")        │
│              ↓                                              │
│  3. Calls backtest-runner MCP with parameter grid           │
│     (single token-budget; bounded by circuit breaker)       │
│              ↓                                              │
│  4. Receives top-K results + sanity metrics                 │
│     (turnover, max consecutive losses, regime split)        │
│              ↓                                              │
│  5. Writes draft research note to research-notes MCP        │
│              ↓                                              │
│  6. Human reviews. Kill-rate is tracked.                    │
└─────────────────────────────────────────────────────────────┘

The kill-rate metric

The fraction of agent-proposed strategies that don’t survive walk-forward + Monte Carlo + paper-trading is the single best productivity metric for an auto-research pipeline. (“Kill-rate” is the author’s term, not a standard one.)

A healthy auto-research pipeline has a kill-rate of 90–95%. A pipeline with a kill-rate of 30% is overfitting at the agent-layer; you’re going to lose money in production. A pipeline with a kill-rate of 99% is wasting compute; tighten the agent’s hypothesis-generation prompt to filter low-quality ideas before they hit backtest.

Anti-patterns at the AI-research layer

In addition to the patterns in §7:

The “best of 1 000 grid points” trap (revisited). Agent reports the top result with a beautiful Sharpe. Cure: report the median and 5th percentile across the grid, and demand that the worst 25% of the grid be profitable.
Silent cost-dropping. Agent backtests with no fees configured; Sharpe inflates. Cure: backtest framework hard-fails if cost configuration isn’t passed in.
Look-ahead via feature naming. Agent uses next_5min_return as a feature without realising what "next" means. Cure: linter on feature names that flags forward-looking time-tokens.
Confabulated metrics. Agent invents a metric, reports it, you don’t notice. Cure: every metric in the report is matched against a closed dictionary of allowed metrics; unknown metrics fail the report.
Picking win-rate over risk-adjusted return. Agents naturally gravitate toward high-win-rate strategies (e.g. mean-reversion with frequent small wins and occasional large losses) because they look better on the surface. Cure: the report ranks by Sortino or by tail-risk-adjusted Sharpe, never by win rate.

Foundation models for time-series — when it makes sense

A foundation transformer for time series is a useful tool but not a universal hammer. In Time Series Forecasting Using Foundation Models the framing is that “the transformer architecture was proposed [originally for natural language and] is now applied to forecasting.” The pragmatic decision rule:

Use a foundation model when you have little training data and need transfer learning (zero-shot or few-shot forecasting on a new symbol).
Use a custom LightGBM or XGBoost when you have abundant domain-specific training data and want maximum signal-to-noise on a known problem.
Use a small RNN/LSTM only as a baseline.
Never put a 100M-parameter transformer on the tick-by-tick decision path; the inference latency alone exceeds the trading window.

A real workflow example end-to-end

A hypothetical agent run on the hypothesis “BTC perp basis × VIX is a regime-conditioned predictor of perp-spot mean reversion”:

Agent ingests via kb_search("perp basis VIX") → finds chunks from Aldridge HFT on basis trades, plus a 2024 paper on cross-asset volatility-conditioned strategies.
Agent drafts hypothesis: “When VIX is in its 80th percentile (high macro vol), BTC perp-spot basis mean-reverts faster than in the bottom 20th percentile.”
Agent calls run_backtest(config={signal: 'basis', conditioner: 'vix_decile', mc_seeds: 1000}).
Result: median Sharpe 0.8, 5th percentile -0.3, but inner quintile (40–60th) Sharpe of 1.4. Conditioning works in moderate-vol regimes; breaks in extremes.
Agent writes note: “Hypothesis partially confirmed — restrict to 40–60th-percentile VIX. Recommend manual review for production scoping.”
Human reviewer: accepts the conditioned version, adds a separate kill-switch on VIX > 90th percentile, deploys for paper-trading.

Without the agent, this workflow takes a quant analyst three days. With the agent, it takes 90 minutes including human review. The agent does not get to skip the human review.

§9 — Production concerns

The mathematics gets you to a paper-tradeable strategy. Production gets you to a P&L. The corpus has solid grounding here — the Latency book gives the framework, Machine Learning Platform Engineering covers the deploy and monitor stack, Mastering Software Architecture covers the patterns that hold a trading system together, and Blue Team Handbook covers the security side that nobody talks about until they’ve been compromised.

Latency budget per segment

Repeated for completeness; the per-segment table from §3 is the operational target:

Segment Budget Implementation WS frame arrival → kernel 0–5 µs Linux io_uring WS frame decode 5–15 µs Rust + simd-json Orderbook update 1–5 µs Rust, lock-free Signal compute 5–30 µs Rust, SIMD where possible Strategy decision 100–1000 µs Python branch via PyO3 Order encode 5–15 µs Rust NIC send 0–5 µs Same as arrival Total tick-to-trade ~150–1100 µs

For market-making strategies on crypto in 2026, sub-1-millisecond tick-to-trade is competitive. For latency arbitrage at the tier-1 level, you need < 100 µs and a co-located rack — the polyglot stack alone won’t get you there.

Co-location and cloud

The cheap version of co-location is “the same AWS region as the matching engine.” The Latency book covers the principle in its co-location chapter — same-region latency to the venue matching engine is typically 1–5 ms; cross-region is 50–200 ms. Cross-region is fatal for any latency-sensitive strategy.

The 2026 venue map:

Binance — ap-northeast-1 (Tokyo) for spot+perps
Bybit — eu-central-1 (Frankfurt)
OKX — ap-east-1 (Hong Kong)
Coinbase + most US perps — us-east-1 (Ashburn)
Hyperliquid — runs its own infrastructure; physical location varies, check current
dYdX v4 — Cosmos-based; geography is more diffuse

If you trade more than one venue, you cannot be in the right region for all of them. Pick one — the dominant venue for your strategy — and accept asymmetric latency for the others.

Risk gates — the non-negotiables

Every strategy ships behind every one of these.

Kill switch at the firm level. Practical Guide references the canonical case: “a kill switch allows termination of all flow from a broker-dealer whose algorithms are determined to be corrupt. In the Knight Capital case, an execution-firm-level kill” switch was missing — and the firm lost ~$440M in 45 minutes. Implement as: a single privileged process holds the cancel-all + halt-all authority; any operator can hit it; the engine respects it within 100 ms.

Position limits per symbol and aggregate. Pre-trade-checked. Every order goes through a check that knows current position and would-be position; reject if the post-fill state exceeds the cap. Easy to implement, easy to forget the corner cases (multi-leg orders, partial fills, race conditions on simultaneous fills).

Drawdown circuit breaker. Chan’s Quantitative Trading covers the pattern: maintain a running high-watermark of cumulative compounded returns, define drawdown at each step as the percentage shortfall from that watermark, and track drawdown duration alongside it. In production: max intra-day loss → halt strategy automatically; max 7-day drawdown → halt and require manual restart with a written incident note.

Self-trade prevention. Most CEXes implement it for you (a maker order from your account against a taker order from the same account is rejected). Verify in your own logs nonetheless.

Stale-feed detector with millisecond resolution. If WS hasn’t ticked in 200 ms, cancel all open quotes. The threshold is symbol-dependent — 200 ms is right for BTC-perp; 1 s might be right for an alt with 10 trades per minute.

Hot config reload and canary deploys

Strategy parameters (γ, k, position limits, fee tiers, venue selection) live in a config file the engine watches. Reload without restart. Never hand-edit live. Promote configs through canary-strategy → 1% capital → full deployment, with explicit rollback if any of the canary's metrics deviate beyond a threshold.

The discipline:

Author writes config change as a PR
CI runs the change against the last 5 days of historical data — does the strategy’s metric distribution shift?
Canary deploy at 1% of normal capital; observe for 24 hours
Promote to full if and only if metrics are within tolerance
Rollback is a single command and is tested weekly

Monitoring

Prometheus + Grafana is the boring 2026 default. Machine Learning Platform Engineering gives the canonical Helm-based install: “install Helm and use some popular Helm commands that help us install and update applications,” followed by helm repo add prometheus-community ... && helm upgrade -i prometheus prometheus-community/prometheus --namespace prometheus --create-namespace. The same repo provides Grafana via helm install grafana grafana/grafana.

HFT-specific custom metrics worth adding beyond the Kubernetes basics:

Metric Why it matters book_update_lag_microseconds{p50, p99, p999} Detects ingestor-strategy decoupling queue_position_decile{symbol} Per-resting-order; if mostly p9 you're at the front, p1 you're at the back fill_toxicity_vpin{symbol} Rolling VPIN; alert at p95 of historical cancel_to_fill_ratio{symbol, side} Spike means your edge has decayed or someone is spoofing you funding_pnl_realised vs expected Detects funding-rate-arb mis-execution inventory_distance_to_cap{symbol} One-sided inventory drift early warning latency_strategy_decision_microseconds{p50, p99, p999} The Python branch — your single biggest variance source

Alerts on tail percentiles, not means. The mean is always fine; the tails are where you die. The Latency book makes this point explicit and the operational discipline follows from it.

Architecture patterns

Mastering Software Architecture covers the patterns that scale a trading system without becoming a tangled mess. Two specifically relevant for HFT:

Event-driven architecture — every market data update, every order, every fill, every metric is an event published to a stream. Strategies subscribe; the strategy logic is a stateless function from event-stream-state to actions.
CQRS (command-query responsibility segregation) — the read path (orderbook view, position view) is separate from the write path (place order, modify, cancel). The two paths can scale independently.

The pragmatic application: an HFT engine should look like a pipeline of decoupled stages connected by ring buffers, not a monolithic strategy class. Stage decoupling is the difference between “fix one bug” and “rewrite half the engine.”

Security — the part that nobody plans for until it bites

Blue Team Handbook (the corpus’s nod to defensive security) covers the incident-response checklist for an algo-trading firm — and trading firms are juicy targets. The minimum:

Credential hygiene. API keys for each venue scoped to the minimum needed (read-only where possible; trade-only without withdraw; never withdraw from the trading account). Rotate quarterly. Store in a secrets manager, never in source.
Network egress filtering. The trading host should reach only the venues and the monitoring stack. No general-internet egress from production hosts.
Audit logs. Every order placement, every config change, every operator action goes to an append-only log shipped to a separate store. A compromised host can’t tamper with the audit log.
Tabletop exercises. Practice “venue compromised, all positions adversarially filled” once a quarter. The first time you respond to a real incident should not be the first time you’ve thought about the playbook.

Disaster recovery

Restart-replay protocol: snapshot orderbook + position state every 100 ms to a durable store; on restart, replay forward from the most recent snapshot. The protocol is straightforward in principle, full of corner cases in practice — the order of snapshot vs trade event, the gap between last snapshot and crash, the resumption of WS subscriptions.

The discipline: practice the full restart from snapshot once a month, in production, during a low-vol window. The first time you do it should not be when something is on fire.

§10 — What classical chart-reading still teaches the algo trader

Modern HFT literature can leave you with the impression that microstructure mathematics has rendered traditional technical analysis obsolete. It hasn’t. Two ideas from the older chart-reading canon survive intact into the algo era — and an MM or stat-arb operator who forgets them is the operator who gets ambushed by regime change.

Markets are nonlinear dynamical systems, not random walks. Bill Williams’s Trading Chaos: Maximize Profits with Proven Technical Techniques makes the point that “chaos” in the physics sense is not disorder — it is the study of complex nonlinear systems whose behaviour is deterministic but practically unpredictable. The implication for an Avellaneda-Stoikov quoter is direct: a pricing model that assumes Gaussian returns will be ambushed by the nonlinear regime in which it doesn’t. Realised vol clusters; correlations break under stress; queue dynamics flip phase. Your model needs a regime detector upstream of the formula, not a fatter-tailed distribution stuffed into the formula.

Pattern size is a proxy for the magnitude of the move that follows. John J. Murphy’s The Visual Investor: How to Spot Market Trends observes that the larger a reversal pattern is on the vertical axis (i.e. the higher its realised volatility during formation), the larger the subsequent price potential tends to be. The 2026 algo translation: regimes with elevated realised volatility tend to produce larger directional moves once they break. The σ² parameter in your AS quoter is not a passive scaling constant. It is the regime detector hiding in plain sight, and it should drive position sizing, not just spread width.

The bridge to ML feature engineering. The patterns the chart-reading authors codified — head-and-shoulders, support-resistance, breakout-on-volume — are the categorical labels an ML feature pipeline naturally learns when handed enough OHLCV data. The classical literature was right about what to look for; modern practice replaces the human eye with a feature pipeline. The two traditions are in agreement, and the working algo trader should read both.

§11 — The 2026 outlook

A few opinions, with the caveat that opinions about 2026 will look silly by 2028.

Rust + Python is winning the mid-tier. It will not displace pure C++ at the very top end. It does not need to. The mid-tier — single-digit-engineer shops running market-making, basis, and stat-arb at retail-accessible scale — is where most new HFT firms in 2026 are being founded, and the polyglot stack is durably ahead there. The hiring data supports this: a Python-and-Rust quant is dramatically easier to find than a fluent low-latency C++ engineer in 2026, and the productivity per engineer is higher.

AI agents are crossing into research, not signal generation. The kill-rate from auto-research pipelines is high enough to make agents net-productive at hypothesis generation and parameter screening. They are not good enough to generate live trading signals end-to-end. The narrow path through which an agent contributes to a live system is via human-reviewed strategy code that the agent helped draft — not through autonomous execution. The Anthropic 2026 MCP roadmap’s emphasis on “task primitives” and “enterprise readiness” suggests the protocol’s roadmap aligns with this: deeper research workflows, audit trails, gated execution. Not agents-running-the-trading-floor.

Crypto fragmentation is permanent. CEX, DEX, perp DEX, intent-based protocols, L2s — the venue list will grow, not shrink. This is good for HFT (more arb opportunities) and bad for capital efficiency (more places to manage inventory). Bet on architecture that accepts fragmentation, not on a single venue. The 2025–2026 consolidation of L2 TVL into Base (46.58%) and Arbitrum (30.86%) suggests the on-chain side may consolidate, but the off-chain CEX list will keep diversifying.

MEV continues to evolve, not vanish. The professional MEV searchers are now capital-rich and software-mature. Hobbyist edges on mainnet are gone. The shift is to L2s and to intent-based protocols where the MEV game has different rules. The January 2026 academic finding that “naive heuristics overstate sandwich activity, with the majority of flagged patterns being false positives and the median net return for these attacks being negative” on private-mempool rollups suggests MEV is being structurally compressed, not eliminated. A 2026 HFT shop that touches DEXes needs to understand both as a mitigation (don’t lose to sandwich attacks on your own swaps) and as an opportunity (back-run legitimate price discoveries).

Tokenisation of equities and FX. The slow movement of TradFi onto crypto-native rails (RWA tokens, on-chain Treasuries, eventually on-chain equities) means crypto-native HFT infrastructure increasingly has to handle non-crypto flows. The polyglot stack is well-positioned; the venue selection becomes a TradFi question more than a crypto-native one.

The ASIC / FPGA frontier for crypto specifically. For the very thin margins on cross-venue latency arb, hardware acceleration is becoming relevant. FPGAs running on the orderbook decode + signal compute path are real in 2026, but they are unnecessary at the mid-tier — they are a year-five optimisation for a shop that has saturated the polyglot stack’s potential. Don’t buy hardware until you’ve run out of software wins.

Regulation 2026. MiCA in the EU (now in full enforcement) makes some forms of cross-venue MM-with-rebate a compliance question. The US perp-DEX status is unsettled but stable enough to operate. The Binance post-settlement environment has stabilised; the firm is back to growth, with stricter compliance. None of this changes the architecture — it changes the venue list.

Closing thought. The polyglot stack is durable; the venue list is not. Bet on architecture, not on a name. And invest in your test harness — the strategy will be wrong, the harness will tell you.

§12 — Methodology, sources, reading list

Which sections lean on cited literature, and which draw on industry knowledge

Section Grounding §1 Intro mixed, lightly cited §2 Strategy taxonomy grounded (Aldridge, Practical Guide, Time Series Forecasting Using Foundation Models) §3 Polyglot Python + Rust grounded (Latency: Reduce Delay, Building GenAI Services with FastAPI) — was external in v1 §4 Crypto microstructure mixed (corpus on principles; internet for 2026 fee schedules + L2 data + intents) §5 Market making heavily grounded (Aldridge, Practical Guide, Stoikov 2024) §6 Other strategies grounded (Aldridge, Practical Guide, Time Series Analysis with Python Cookbook) §7 MCP / GenAI connectors mixed (corpus on FastAPI + LLM orchestration + MCP mention; internet for current MCP spec) §8 Backtesting + AI auto-research grounded (Zakamulin, Chan, TS Cookbook, Foundation Models; internet for CPCV) §9 Production concerns grounded (Latency, ML Platform Engineering, Mastering Software Architecture, Blue Team Handbook; Practical Guide on kill switch) §10 Classical chart-reading grounded (Williams, Murphy) §11 Outlook opinion, with internet citations on L2 data and MEV research §12 Methodology this section

Bibliography

Core HFT theory and microstructure:

Aldridge, Irene. High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading Systems. 1st ed. Wiley, 2010.
Aldridge, Irene. High-Frequency Trading: A Practical Guide to Algorithmic Strategies and Trading Systems. 2nd ed. Wiley, 2013.
Chan, Ernest P. Quantitative Trading: How to Build Your Own Algorithmic Trading Business. Wiley, 2008.
Chan, Ernest P. Algorithmic Trading: Winning Strategies and Their Rationale. Wiley, 2013.
Zakamulin, Valeriy. Market Timing With Moving Averages: The Anatomy and Performance of Trading Rules. Palgrave Macmillan, 2017.
Grimes, Adam. The Art and Science of Technical Analysis: Market Structure, Price Action, and Trading Strategies. Wiley, 2012.

Engineering and systems:

Latency: Reduce Delay in Software Systems. Manning, 2024.
Building Generative AI Services with FastAPI. O’Reilly, 2024.
Machine Learning Platform Engineering. Manning, 2024.
Mastering Software Architecture. O’Reilly, 2024.
Architecting AI Software Systems. O’Reilly, 2024.
Rust for Blockchain Application Development. Packt, 2024.
Blue Team Handbook: Incident Response Edition. Created Independently, 2014; reissued.

Machine learning and time-series:

Hands-On Machine Learning with scikit-learn and PyTorch. O’Reilly, 2025.
Time Series Analysis with Python Cookbook. Packt, 2024.
Time Series Forecasting Using Foundation Models. Manning, 2025.
Practical Generative AI with ChatGPT. O’Reilly, 2024.

Classical chart-reading:

Williams, Bill. Trading Chaos: Maximize Profits with Proven Technical Techniques. 2nd ed. Wiley, 2004.
Murphy, John J. The Visual Investor: How to Spot Market Trends. 2nd ed. Wiley, 2009.

Internet sources cited (with access dates)

Model Context Protocol — Specification 2025–11–25. Accessed May 2026.
Model Context Protocol — 2026 Roadmap. Accessed May 2026.
Hyperliquid — Fees Documentation. Accessed May 2026.
Stoikov, S. et al. Market Making in Crypto. SSRN 5066176. December 2024 / January 2025.
Logarithmic regret in the ergodic Avellaneda-Stoikov market making model. arXiv:2409.02025. Accessed May 2026.
The Block — Hyperliquid delists JELLYJELLY memecoin. March 2025; accessed May 2026.
cryptonews.com — Inside the $13.5M JELLY Exploit Drama. March 2025; accessed May 2026.
BlockEden — L2 Consolidation War: Base, Arbitrum. February 2026; accessed May 2026.
How to Serve Your Sandwich? MEV Attacks in Private L2 Mempools. arXiv:2601.19570. January 2026; accessed May 2026.
Cross-Rollup MEV: Non-Atomic Arbitrage Across L2 Blockchains. arXiv:2406.02172. 2024; accessed May 2026.
eco.com — UniswapX 2026 Guide. Accessed May 2026.
cow.fi — Understanding Crypto Intents. Accessed May 2026.
Towards AI — Combinatorial Purged Cross-Validation Method. Accessed May 2026.

What this corpus and the cited internet sources do not cover

The exact production parameters of any specific operating shop (γ, k, exact symbol selection, exact venue routing) — these are competitive secrets and are not in the literature. Sub-microsecond C++ tier-1 specifics (FPGA, ASIC) — covered by industry conference talks, not by the corpus. Real-time regulatory changes after May 2026 — verify with the relevant venue and jurisdiction at write-time of any production decision.

Short follow-up reading list

If the article was useful, the ten books to read next, in order:

Aldridge — High-Frequency Trading: A Practical Guide (2nd ed.) — the spine
Chan — Algorithmic Trading: Winning Strategies and Their Rationale — backtest discipline
Zakamulin — Market Timing With Moving Averages — walk-forward methodology
Grimes — The Art and Science of Technical Analysis — the human side
Latency: Reduce Delay in Software Systems — production latency engineering
Machine Learning Platform Engineering — deploy, monitor, scale
Building Generative AI Services with FastAPI — the agent layer
Hands-On Machine Learning with scikit-learn and PyTorch — the ML toolkit
Time Series Forecasting Using Foundation Models — when transformers matter
Mastering Software Architecture — patterns that hold the system together

Comments and corrections are welcome. The Avellaneda-Stoikov derivation, the VPIN definition, and the JELLYJELLY incident details are reproduced from the cited sources; if you spot an error against the original papers or news reports, please flag it — algo traders die from undetected formula bugs and from outdated incident summaries.

Vlad Benkovskyi, codefather.dev

High-Frequency AI Based Trading on Crypto in 2026

High-Frequency AI Based Trading on Crypto in 2026

Table of contents

TL;DR — for the reader who needs the bottom line in 60 seconds

§1 — Why this article exists

§2 — HFT 2026: the strategy taxonomy

Latency-driven

Liquidity-providing

Statistical

Predictive ML

A comparative fit table

§3 — The Python + Rust polyglot

Why pure C++ lost ground

Why pure Python isn’t enough

Tail-latency theory

Memory allocator analysis

PyO3 deep-dive

Inter-process communication

Async runtime selection

Crate picks

A “learn this much Rust” curriculum for the Python quant

Polyglot-stack failure modes

§4 — Crypto microstructure 2026

Fee schedules — the table that drives strategy economics

Order types you actually use

Tick size and lot precision

Sequence-gap recovery

Funding rates as a structural HFT input

CEX-DEX latency profile

MEV taxonomy for the trader’s perspective

Queue position and adverse selection in crypto

§5 — Market making, the deep dive

The bid-ask spread as compensation

The Avellaneda-Stoikov derivation, walked through

Inventory-based parameter calibration

Cross-exchange MM — the most-profitable variant

Toxicity filters: VPIN

Toxicity filters: Kyle’s lambda

Queue position economics

Production code skeleton

Real failure-mode case studies

§6 — Other HFT strategies on crypto

Statistical arbitrage

Triangular arbitrage

Latency arbitrage

Funding-rate arbitrage

Liquidation hunting on perps

Defensive: spoofing and iceberg detection

ML signals as a strategy multiplier

§7 — Trading automation via MCP / GenAI connectors

Agents are researchers, not executors

Reference architecture

Connector inventory in detail

Reference Python skeleton

Token-budget circuit breaker

Provenance manifest

Hard rules with reasoning

Anti-patterns

§8 — Backtesting and auto-research with AI

Why HFT backtests are uniquely hard

Walk-forward methodology

Monte Carlo perturbation

Combinatorial purged cross-validation

Time-series-specific backtest pitfalls

The auto-research workflow

The kill-rate metric

Anti-patterns at the AI-research layer

Foundation models for time-series — when it makes sense

A real workflow example end-to-end

§9 — Production concerns

Latency budget per segment

Co-location and cloud

Risk gates — the non-negotiables

Hot config reload and canary deploys

Monitoring

Architecture patterns

Security — the part that nobody plans for until it bites

Disaster recovery

§10 — What classical chart-reading still teaches the algo trader

§11 — The 2026 outlook