Start now →

I Built an ML Trading System That Learned to Detect Its Own Lies

By Jose Pablo Garcia Meza · Published April 20, 2026 · 11 min read · Source: Trading Tag
TradingAI & Crypto
I Built an ML Trading System That Learned to Detect Its Own Lies

I Built an ML Trading System That Learned to Detect Its Own Lies

Jose Pablo Garcia MezaJose Pablo Garcia Meza9 min read·Just now

--

A complete pipeline for predicting SPY movements, and the validation framework that proved it had no real edge

Most ML trading projects end one of two ways: either the author declares victory with impressive-looking backtests, or the project quietly disappears. This one ends differently.

ML Financial Predictor v3.0 is a complete, production-grade machine learning pipeline for predicting whether SPY (S&P 500 ETF) will close higher 10 trading days from now. It uses institutional-grade methodology: chronological splits with purging gaps, Average Uniqueness sampling, Monte Carlo significance testing, and a no-overlap backtesting engine.

The models produced a Sharpe Ratio of 1.82, a 74% win rate, and an equity curve that looked genuinely compelling.

They also have zero real predictive edge.

This is the story of how the system proved it, and why that’s the actual result worth publishing.

The Problem Setup

The task is binary classification on a financial time series:

Why this formulation? A 10-day horizon sits in a useful zone: long enough to avoid microstructure noise, short enough to be actionable, and tractable enough that technical indicators carry some evidence of predictive value in the academic literature. The output is a probability estimate P̂(y=1 | Xₜ) that gets thresholded at 0.55 to generate discrete signals.

The model was trained on panel data from five tickers: AAPL, MSFT, GOOGL, JPM, and SPY, representing roughly 21% of the S&P 500 by weight.

The Pipeline

The system is structured as six independent, reproducible modules:


yfinance (OHLCV)
└─► Feature Engineering # 26 technical indicators + shift(1)
└─► Temporal Split # Purging (10 market days)
└─► Models # Random Forest + XGBoost
└─► Signals # Threshold 0.55 → BUY / HOLD
└─► Backtest # No-overlap (fixed horizon)
└─► Statistical Evaluation (Monte Carlo)

The Temporal Split

This is the most critical design decision in the system. Standard train_test_split is not just suboptimal for time series; it actively introduces data leakage. To mitigate this, I implemented the structure detailed in Table 1, which utilizes strict market-day gaps.

Table 1: Temporal Split Schema with Strict Purging Gaps (h=10).

The purging gap exists because with a 10-day target horizon, the last training label uses price P(t+10), which overlaps with prices already in the validation window. Without the gap, the model is technically being evaluated on data it partially trained on.

The Test set is never touched until final evaluation. Not for debugging. Not for hyperparameter intuition. One use, ever.

Feature Engineering: 26 Indicators With a Causality Guarantee

All features are technical: momentum, volatility, moving averages, RSI, MACD, Bollinger Bands, ATR, relative volume, and 52-week positioning. The implementation detail that matters most is the global shift(1) applied after all features are computed:


# After all 26 features are calculated:
feature_df = feature_df.shift(1)

On day t, the model only sees features computed through close of t-1. It’s a blunt instrument, but it’s ironclad. There’s no per-indicator shift that someone can forget to apply to one column six months later.

Average Uniqueness: The Overlap Problem

Here’s a subtlety most practitioners skip. With a 10-day prediction horizon, labels overlap: the target at day t uses P(t+10), and the target at t+1 uses P(t+11). They share 9 out of 10 underlying data points. The genuinely independent information content in a dataset of n observations is approximately n/10.

This is the Average Uniqueness concept from López de Prado:

The practical implication: each tree in the ensemble samples 10% of observations, not the standard 80%. Applied identically to both models:

Both models also use max_depth=2, the smallest meaningful tree structure, deliberately biased toward simplicity.

The Results That Looked Good

Equity Curve — Bull Market Test Period (2023–2026)

Press enter or click to view image in full size
Figure 1: Comparative Equity Curve: Random Forest Strategy vs. SPY Benchmark (2023–2026).

The backtest simulates real execution under one hard constraint: only one position open at a time. A BUY signal on day t opens a position that closes exactly 10 trading days later. Any new signals during those 10 days are ignored. Without this constraint, overlapping positions create synthetic leverage and the illusion of a strategy that doesn’t exist.

Under these conditions, the strategy produced the results shown in Figure 1. At first glance, the metrics are outstanding: a Sharpe of 1.82 and a 74.36% win rate over 39 trades.

Monte Carlo Significance Test: The Only Fair Comparison

Before looking at the results, a note on baselines that often gets skipped.

The standard comparison for a trading strategy is buy-and-hold: “did the model beat SPY?” But that comparison is structurally unfair here. A buy-and-hold strategy is invested every single day of the period. This model fires roughly 39 trades over 3 years, each lasting 10 days. They’re operating at completely different frequencies. Comparing their Sharpe Ratios is like comparing the sprint time of a marathoner and a 100m runner.

The correct baseline is a random agent that operates under the same constraints: fires signals with uniform probability, applies the same no-overlap rule, and holds positions for exactly 10 market days. To build the empirical null distribution, 1,000 random agents were simulated. The resulting distributions for both models are visualized in Figure 2, while the specific statistical breakdowns are summarized in Table 2.

Press enter or click to view image in full size
Press enter or click to view image in full size
Figure 2: Left: Random Forest Monte Carlo Results. Right: XGBoost Monte Carlo Results.

Bull Market (2023–2026) (Test Set)

Table 2: Monte Carlo Significance Test Results (Z-Score and P-Value Analysis).

This is the central statistical finding of the project.

Random Forest’s Sharpe of 1.82 looks impressive in isolation. But the random agents operating every 10 days in the same bull market achieve a mean Sharpe of 1.31. The gap between 1.82 and 1.31 sounds meaningful, until you compute the Z-score of 0.58 and p-value of 0.28. That difference is well within the noise floor of what a purely random 10-day trading agent would produce in this environment.

This is the result the equity curve was hiding. The model’s returns are not evidence of signal. They’re evidence that any strategy trading long positions every 10 days during 2023–2026 would have posted strong Sharpe Ratios, because the underlying asset appreciated ~70% of the time.

Where the Monte Carlo Verdict Leads

The Monte Carlo test closes the loop on the main statistical question. Two additional diagnostics complete the picture.

The Conviction Stress Test

The Monte Carlo operates on Sharpe Ratios, which is a distributional metric. To check the model’s classification signal specifically, I built a direct precision comparison against a random agent, as shown in Table 3., but now measuring hit rate at threshold >0.55 rather than Sharpe.

Table 3: Conviction Stress Test: Trained Model Precision vs. Equivalent Random Agent.

Random Forest lands 1.17 percentage points above the equivalent random agent: a positive edge, but a razor-thin one. At this margin, with only 39 trades over the test period, the difference is well within sampling noise. The Monte Carlo result (p=0.28) already told us the return distribution isn’t statistically distinguishable from randomness; the signal precision is consistent with that verdict.

What makes this particularly interesting is the contrast with XGBoost, which underperformed the random Sharpe mean in the Monte Carlo. That inversion between the two models deserves a hypothesis.

Why did the “worse” model do better?

The instinctive read is that Random Forest outperforming XGBoost here is anomalous, because XGBoost is the more powerful algorithm. But in a low signal-to-noise environment like financial time series, model capacity can work against you.

XGBoost is a sequential boosting algorithm: each tree is built to correct the residual errors of the previous one. In a domain with genuine signal, this error-correction mechanism is its greatest strength. In a domain that is largely noise, those “residual errors” are themselves noise, and the model dutifully learns to fit them. The result is a model that has chased patterns that don’t exist and in the test period, pays for it with below-random Sharpe performance.

Random Forest, by contrast, builds trees independently and averages their outputs. It has no error-correction mechanism across trees, which is precisely why it can’t overfit to sequential noise patterns the way boosting can. Its higher structural bias turns out to be a form of protection. The model is too simple to learn the specific noise fingerprint of the training period, so it degrades more gracefully into approximate randomness rather than active anti-performance.

This is speculative: 39 trades is not enough to be certain. But it is consistent with what the bias-variance tradeoff predicts in near-zero-signal environments. Lower capacity models fail more quietly. XGBoost doesn’t just fail to find signal; it finds anti-signal. Random Forest fails to find signal and stops there.

The Year-by-Year Decomposition

Aggregating accuracy over 2023–2026 obscures important regime shifts. When broken into calendar years in Table 4, the 70% precision in 2024 is revealed not as a success, but as evidence of overfitting.

Table 4: Annual Precision Breakdown: Identifying Performance Decay and Regime Dependency.

A model that hits 70% out-of-sample and then collapses to 33.3% when the regime shifts didn’t learn rules about markets. It memorized the idiosyncratic conditions of the 2024 AI rally.

33.3% is not “below benchmark.” A random binary classifier has an expected accuracy of 50%. The model is actively wrong more often than chance in Q1 2026. That’s the temporal signature of severe overfitting.

The aggregate precision masks a swing from 69.7% to 33.3% across three years of supposedly out-of-sample data. Technically it summarizes the period. It tells you nothing about whether the model would survive a regime change, which is the only question that matters before deployment.

Why Standard Metrics Created a False Narrative

The pattern is consistent: every metric that looked good was being compared against the wrong baseline. Table 5 provides a final audit of these metrics, contrasting conventional interpretations with the reality revealed by our validation framework.

Press enter or click to view image in full size
Table 5: Metric Audit: Why Standard Performance Indicators Created a False Narrative.

What the Project Is Actually Worth

The prediction system has no real edge. That’s the honest conclusion.

But the validation pipeline does.

A system capable of detecting its own false positives before they become real capital losses has genuine methodological value, regardless of whether the underlying model works.

The trifecta of validation that made this visible:

  1. Rigorous backtesting with realistic constraints (no overlap, honest position sizing)
  2. Conviction Stress Test against an equivalent random agent, not just a Monte Carlo Sharpe comparison
  3. Year-by-year precision decomposition to expose regime dependency hidden by aggregation

None of these individually would have caught everything. The combination catches it all. This triad should be the minimum viable validation stack for any ML financial system before production, and arguably a useful template for any ML evaluation where you care about the real world, not just the held-out set.

The Broader Lesson

The transferable insight from this project isn’t “be careful with data leakage.” That’s too specific.

It’s this: the comparison baseline you choose determines what your metric means.

A Sharpe of 1.81 is a Sharpe of 1.81 relative to the risk-free rate and your own volatility. Compared to a random agent operating every 10 days in the same market, it’s 0.51 points above noise and not statistically distinguishable from it. A model edge of +1.17 pp on precision sounds positive until you remember there are only 39 trades behind it.

Every impressive number is impressive compared to something. The work is making sure that something is actually the right reference. This applies to A/B testing, model evaluation, business KPIs, and most empirical claims you’ll encounter outside of finance too.

What’s Next

The roadmap is clear and hierarchically prioritized:

The glass ceiling. Fractional differentiation to make the price series stationary while preserving long-memory properties. Feature clustering to eliminate the multicollinearity between 26 indicators that are largely transformations of the same close price.

Then regime-awareness. Once the data is clean and features are genuinely diverse, hidden Markov models or clustering on VIX + momentum to activate different models per regime.

Then scale. Expand from 5 tickers to 50–100 S&P 500 components for robust Monte Carlo power.

References

Source code & MLflow experiments:

https://github.com/pablogarcia1/ml-financial-predictor

This article was originally published on Trading Tag and is republished here under RSS syndication for informational purposes. All rights and intellectual property remain with the original author. If you are the author and wish to have this article removed, please contact us at [email protected].

NexaPay — Accept Card Payments, Receive Crypto

No KYC · Instant Settlement · Visa, Mastercard, Apple Pay, Google Pay

Get Started →