Walk-Forward & Monte Carlo

The test that tests the test

You built a strategy, optimized the parameters, and it looks great on 20 years of data. But here's the uncomfortable truth: you already knew those 20 years when you chose the parameters. The performance is partly real edge and partly hindsight. Walk-forward analysis is the method that separates the two.

The scientific method in trading starts with a partition. Kaufman describes the standard split:

The scientific method begins with in-sample data for testing an idea. Normally, it is about 60% of the available data. Another 20% can be used for validation after settling on the rules and parameters. The remaining 20% will be used for out-of-sample testing, the final step. — Kaufman, Trading Systems and Methods

Three slices: 60% in-sample (develop your idea), 20% validation (fix oversights), 20% out-of-sample (the real exam). The in-sample data is your playground. The validation set is your second chance. The final out-of-sample set is sacred — once you've used it, there's no going back.

Why out-of-sample performance degrades

The first time you run your strategy on unseen data, expect disappointment. Kaufman is blunt about this:

The results from either of the out-of-sample periods will rarely have the same successful profile as the in-sample test. More often, it will post about the same returns but with much higher risk. A drop in the information ratio from 2.0 to 1.0 for the out-of-sample data is not bad, but a negative ratio (an outright loss) is a failure. — Kaufman, Trading Systems and Methods

A 2.0 information ratio in-sample dropping to 1.0 out-of-sample is normal. The strategy still works — it just works half as well when it encounters patterns it's never seen. An outright loss, though, means the in-sample performance was curve-fit noise.

Why the degradation? Because real markets contain patterns your backtest never encountered. A volatility spike in 2008, a zero-rate environment in 2020, a meme-stock squeeze in 2021 — each introduces conditions your parameters weren't tuned for. The bigger the gap between in-sample and out-of-sample performance, the more overfit the strategy.

Walk-forward testing: the rolling version

Static in-sample / out-of-sample splits have a weakness: they only give you one out-of-sample test. Walk-forward testing (also called step-forward testing) solves this by rolling the window forward through time and accumulating out-of-sample results.

Kaufman's procedure:

Select the total test period — e.g., 30 years of daily data.
Select the in-sample window size — e.g., 4 years.
Optimize on the first 4-year block (1988–1991).
Apply the best parameters to the next year (1992) — this is the out-of-sample period.
Slide the window forward by one year and repeat (next optimization: 1989–1992, next out-of-sample: 1993).
Accumulate all out-of-sample periods into a single equity curve.

The final walk-forward equity curve is entirely out-of-sample — every point on it was traded with parameters chosen before that data was seen. If the accumulated curve is profitable after costs, you have evidence (not proof) that the strategy adapts.

Window sizing matters

Kaufman provides a practical example: in-sample windows of 120 days with out-of-sample periods of 20 days. But he warns about short-term bias:

Step-forward testing can introduce a bias that favors faster trading models. Because the in-sample test used only 4 years of data, a long-term trend model may only post a few trades. Without enough data, testing cannot find which calculation period is best. — Kaufman, Trading Systems and Methods

The fix: make the in-sample window long enough that even slow strategies generate meaningful trade counts. If your system trades monthly, a 2-year in-sample window is too short — you'd only have ~24 trades to evaluate. Increase to 5+ years.

The feedback trap

The most serious concern about step-forward testing is feedback. Each application uses all of the data, a combination of in-sample and out-of-sample. If the test is performed more than once, there is no out-of-sample data, which violates our concept of scientific testing. — Kaufman, Trading Systems and Methods

Walk-forward testing done once is valid. Walk-forward testing done repeatedly — tweaking rules between runs — contaminates the out-of-sample data. Every re-run leaks information from the "unseen" data back into your design choices. The solution: reserve a final holdout set that the walk-forward process never touches.

What to do when out-of-sample disappoints

The validation set (that middle 20%) is your fix-it window. Kaufman's rule:

If the fix is logical, such as a volatility filter that applies to many situations, that's good. If you are changing a stop-loss to avoid a devastating loss in 2008, that's bad. — Kaufman, Trading Systems and Methods

Good fixes are general — adding a volatility filter, capping position size during regime changes, widening stops in high-ATR environments. Bad fixes are specific — carving out a rule to avoid one particular drawdown. General fixes improve robustness; specific fixes are overfitting with extra steps.

Integrity checklist

Kaufman's testing integrity principles, distilled:

More data is always better. Long test periods include bull, bear, and sideways markets plus price shocks of various sizes.
Cross-market validation. A trend strategy profitable on the S&P should also work (less perfectly) on the NASDAQ. If it profits on one and loses on the other, something is wrong.
Never touch the final out-of-sample data twice. Once used, it's contaminated.
Watch for structural changes. A company that evolved from manufacturing to finance (like GE) may have data that no longer represents its current behavior.

Quick check

Question 1 / 30 correct

You split 20 years of data into 60/20/20 in-sample/validation/out-of-sample. Your strategy has a 2.1 information ratio in-sample and 0.9 out-of-sample. What's the verdict?

What you now know

60/20/20 is the standard data partition — in-sample, validation, out-of-sample.
Out-of-sample performance always degrades — an information ratio drop from 2.0 to 1.0 is normal; negative is failure.
Walk-forward testing rolls the optimization window forward, building an equity curve entirely from out-of-sample segments.
The in-sample window must be long enough that slow strategies generate enough trades — too short introduces short-term bias.
Never re-run walk-forward tests after peeking at results — that's feedback contamination.
Good fixes are general (volatility filter); bad fixes are specific (avoiding one date).

Next: Gap Trading — the four gap types measured by Murphy and Edwards & Magee, with Bulkowski's statistical fill rates to separate the actionable from the noise.