I built a Bitcoin price forecasting model that achieves 0.55% error, here is everything I learned
Arber Zylyftari6 min read·Just now--
A full walkthrough of training a stacked LSTM on 2 million minutes of Coinbase data, the mistakes that cost me hours and hours, and how I fixed them.
When I started this project I thought the hard part would be the neural network. It turned out the hard part was the data and every mistake I made in preprocessing showed up later as a completely wrong model.
This is the full story of how I built a BTC price forecasting system from scratch, what broke along the way, and what I would do differently.
The dataset
I used the Coinbase BTC/USD dataset spanning December 2014 to January 2019 one row per 60-second trading window, just over 2 million rows total. On the surface it looks clean. Open, High, Low, Close, Volume, and a timestamp. Simple enough.
The first thing I did was audit the missing data. What I found stopped me immediately.
109,069 rows had every single column as NaN at the same time. Not just one or two columns all seven. Zero partial NaN rows. This told me these were not corrupted records. They were minutes where nobody traded on Coinbase at all. The exchange was just quiet.
The second thing I found was more subtle. When I reindexed the DataFrame against a complete 60-second DatetimeIndex, I discovered a further 58,354 timestamps that were absent from the CSV entirely not NaN rows, just missing rows that were invisible until I looked for them. If I had forward-filled without reindexing first, those gaps would have stayed hidden.
The longest consecutive gap in the data was 38.4 hours. Forward-filling across that produces 2,303 identical rows in a row a completely flat line that the model would learn as a genuine price signal. I trimmed everything before 2017-01-01 to get rid of these artifacts entirely.
Feature selection
The correlation matrix made the feature decision obvious. Open, High, Low and Weighted_Price are all correlated at essentially 1.0 with Close. There is no independent information in any of them they are all measuring the same thing in the same minute. I dropped all four.
The only genuinely independent feature in the dataset is Volume_(BTC), which has a correlation of just 0.15 with price. I kept it but applied a log₁p transform first because the raw distribution is extremely right-skewed with massive outlier spikes.
For time I used cyclic sin/cos encoding rather than raw integers. The reason is simple: hour 23 and hour 0 are one minute apart, but if you encode them as the integers 23 and 0 the model thinks they are as far apart as possible. Sin/cos wraps correctly.
Final feature set: Close, log₁p(Volume), hour_sin, hour_cos, dow_sin, dow_cos. Six features total.
The data pipeline
One problem I ran into early was memory. A window size of 1,440 steps means each training sample is a (1440, 6) array. Pre-computing all possible windows from 698,000 rows would require roughly 27GB of RAM not something Colab can handle.
The solution was tf.data.Dataset with a sliding window and shift=5. This streams one window every 5 minutes directly from the scaled arrays without materialising anything in memory. Peak memory during training stayed under 2GB.
The splits
I made a serious mistake on my first training run. I set the test period to run all the way to January 2019, which includes the full BTC crash from $6,500 down to $3,500. The model had never seen prices below $6,000 during training, so every prediction came out around $6,200 regardless of what was actually happening. MAE was $608.
The fix was setting a TEST_END cutoff at August 2018, keeping all three splits within the 2017-2018 bull market regime where price behaviour was consistent. The model is not being asked to extrapolate into a regime it has never seen.
No shuffling at any stage. Shuffling a time series leaks future price information into the training set.
The scaler is fitted on training data only. Fitting on the full dataset would leak the price range of the validation and test periods the model would indirectly know future price levels before ever encountering them.
The model
Input → (1440, 6)
LSTM(128, return_sequences=True)
Dropout(0.2)
LSTM(64, return_sequences=True)
Dropout(0.2)
LSTM(32, return_sequences=False)
Dropout(0.1)
Dense(1)Three LSTM layers with Dropout between each one. The first two return the full sequence so each layer can learn at a different level of temporal abstraction. The third collapses to a single vector before the Dense output.
Optimizer: AdamW with learning_rate=0.0005 and weight_decay=1e-4. I added ReduceLROnPlateau which halves the learning rate whenever validation loss stops improving, and EarlyStopping with patience=7 which restores the best weights automatically.
Results
The two lines track each other very closely across the entire test period. The prediction error distribution is centred just above zero with a mean of $31.16, meaning the model has a small positive bias it slightly over-predicts on average. MAE landed at $38.44 on an asset with a mean price of $6,933. That is 0.55% error.
The scatter plot tells the same story, the predictions track the diagonal tightly across the full price range tested, with no obvious systematic bias at any particular price level.
The progression that got here
My first run gave $608 MAE distribution shift from evaluating on the 2018 crash. My second run gave $91 after fixing the splits. My third run dropped to $32 after moving to a 3-layer architecture with AdamW. The final model after a necessary retrain sits at $38.44, consistent with run 3.
What I would do differently
The biggest thing I would change is predicting percentage returns rather than absolute price. A model trained on $6,000-$15,000 prices simply cannot generalize to $3,500 because it has never seen those values. Returns are stationary a 0.1% move looks the same at any price level. This is the standard approach in production financial time series models.
I would also experiment with attention mechanisms. LSTMs process sequences step by step and can struggle to connect signals that are far apart in time. A transformer-based architecture would handle the 1,440-step window differently and might find patterns that the LSTM misses.
Live demo
The full project is deployed at
BTC Price Forecasting | Arber Zylyftari
Streaming sliding windows - no RAM crash def make_dataset(data, shuffle=False): total_length = WINDOW_SIZE + HORIZON ds…
btc-price-forecasting-arberzylyftari.vercel.app
the site runs the actual trained model in the browser using TensorFlow.js, loads held-out 2018 test windows and lets you replay predictions at any date and time within the test period.
All three notebooks and the full source are on GitHub: https://github.com/arberzylyftari/btc-price-forecasting
Thank You For Reading