walk-forward-validation

Walk-forward validation framework for trading strategies and ML models with time-series-aware splits, overfit detection, and regime-aware validation

7 stars

byagiprolabs

View on GitHub Installation ↓

Best use case

walk-forward-validation is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Walk-forward validation framework for trading strategies and ML models with time-series-aware splits, overfit detection, and regime-aware validation

Teams using walk-forward-validation should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/walk-forward-validation/SKILL.md --create-dirs "https://raw.githubusercontent.com/agiprolabs/claude-trading-skills/main/skills/walk-forward-validation/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/walk-forward-validation/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How walk-forward-validation Compares

Feature / Agent	walk-forward-validation	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Walk-forward validation framework for trading strategies and ML models with time-series-aware splits, overfit detection, and regime-aware validation

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Walk-Forward Validation

Walk-forward validation framework for trading strategies and ML models. Standard cross-validation (k-fold, random splits) fails catastrophically for financial time series because it introduces lookahead bias and ignores autocorrelation. This skill covers proper time-series validation techniques including rolling and expanding windows, purged cross-validation, combinatorial purged cross-validation (CPCV), and overfit detection metrics.

## Why Standard Cross-Validation Fails

Standard k-fold CV assumes data points are independent and identically distributed (IID). Financial time series violate both assumptions:

1. **Lookahead bias** — Random splits let the model train on future data and predict past data, artificially inflating performance.
2. **Autocorrelation** — Adjacent observations are correlated. A random split that puts Monday in test and Tuesday in train leaks information.
3. **Regime dependence** — Markets shift between regimes. A model trained on a bull market and tested on a bull market tells you nothing about bear market performance.
4. **Label overlap** — If labels are computed over windows (e.g., 24h forward return), adjacent train/test samples share label computation periods, leaking information.

## Walk-Forward Framework

### Rolling Window (Fixed Train Size)

The train window has a fixed size and slides forward in time. This is preferred when you believe older data is less relevant (common in crypto).

```
Window 1: [===TRAIN===][=TEST=]
Window 2:    [===TRAIN===][=TEST=]
Window 3:       [===TRAIN===][=TEST=]
```

**Parameters:**
- `train_size`: Number of bars/days in the training window
- `test_size`: Number of bars/days in the test window
- `step_size`: How far to advance between folds (often equals `test_size`)

### Expanding Window (Growing Train)

The train window starts at the beginning and expands forward. This uses all available historical data, which helps when data is scarce.

```
Window 1: [==TRAIN==][=TEST=]
Window 2: [====TRAIN====][=TEST=]
Window 3: [======TRAIN======][=TEST=]
```

**Parameters:**
- `min_train_size`: Minimum training samples before first fold
- `test_size`: Fixed test window size
- `step_size`: How far to advance between folds

### Choosing Between Them

| Factor | Rolling | Expanding |
|---|---|---|
| Data recency | Prioritizes recent data | Uses all history |
| Regime changes | Better adapts to new regimes | May dilute recent regime |
| Sample size | Fixed, may be small | Grows over time |
| Crypto preference | Preferred for < 6mo horizons | Better for regime-stable models |

## Purging and Embargo

### Purging

Remove training samples whose labels overlap with the test set's time range. If a label is computed as the 24h forward return starting at time `t`, any training sample where `t + 24h` extends into the test period must be purged.

```python
def purge_train_indices(
    train_idx: list[int],
    test_start: int,
    label_horizon: int,
    timestamps: list[int],
) -> list[int]:
    """Remove train samples whose label windows overlap test period."""
    test_start_time = timestamps[test_start]
    return [
        i for i in train_idx
        if timestamps[i] + label_horizon < test_start_time
    ]
```

### Embargo

Add a buffer gap between the end of training and start of testing to account for serial correlation that purging alone does not eliminate.

```
[===TRAIN===][--EMBARGO--][=TEST=]
```

Typical embargo sizes:
- **1-minute bars**: 60–240 bars (1–4 hours)
- **5-minute bars**: 12–48 bars (1–4 hours)
- **Hourly bars**: 6–24 bars (6–24 hours)
- **Daily bars**: 2–5 bars (2–5 days)
- **Crypto rule of thumb**: Embargo >= 2x the label computation horizon

## Combinatorial Purged Cross-Validation (CPCV)

CPCV (Lopez de Prado, 2018) generates all possible train/test combinations from `N` groups while maintaining temporal ordering. This produces far more test paths than standard walk-forward, enabling statistical tests for overfitting.

**Key properties:**
- Splits data into `N` contiguous groups
- For each combination of `k` test groups, the remaining `N-k` groups form the training set
- Applies purging and embargo at each train/test boundary
- Produces `C(N, k)` backtest paths (e.g., N=6, k=2 gives 15 paths)

See `references/methodology.md` for the full CPCV algorithm and formulas.

## Overfit Detection

### Deflated Sharpe Ratio (DSR)

The observed Sharpe ratio must be adjusted for:
- Number of strategies tested (multiple testing)
- Non-normality of returns (skewness, kurtosis)
- Length of the backtest

```python
import numpy as np
from scipy.stats import norm

def deflated_sharpe_ratio(
    observed_sr: float,
    num_trials: int,
    backtest_length: int,
    skewness: float = 0.0,
    kurtosis: float = 3.0,
) -> float:
    """Compute the probability that observed SR > 0 after deflation.

    Args:
        observed_sr: Annualized Sharpe ratio of the selected strategy.
        num_trials: Number of strategies tested (including discarded ones).
        backtest_length: Number of return observations.
        skewness: Skewness of returns.
        kurtosis: Excess kurtosis of returns.

    Returns:
        p-value (probability SR is genuinely > 0).
    """
    sr_std = np.sqrt(
        (1 - skewness * observed_sr + (kurtosis - 1) / 4 * observed_sr**2)
        / (backtest_length - 1)
    )
    # Expected max SR under null (Euler-Mascheroni approximation)
    euler_mascheroni = 0.5772156649
    expected_max_sr = norm.ppf(1 - 1 / num_trials) * (
        1 - euler_mascheroni
    ) + euler_mascheroni * norm.ppf(1 - 1 / (num_trials * np.e))
    dsr = norm.cdf((observed_sr - expected_max_sr) / sr_std)
    return dsr
```

A DSR below 0.95 suggests the observed performance is likely due to overfitting across the trials tested.

### Probability of Backtest Overfitting (PBO)

PBO uses CPCV to measure the fraction of backtest paths where the in-sample optimal strategy underperforms the median out-of-sample. A PBO above 0.50 indicates more-likely-than-not overfitting.

See `references/overfit_detection.md` for complete derivations and implementation details.

## Crypto-Specific Considerations

1. **Shorter windows**: Crypto regimes change faster than equities. A 90-day rolling window may be more appropriate than 252 days.
2. **24/7 markets**: No weekends or holidays to account for, but funding rate resets (every 8h on perps) create microstructure effects.
3. **Survivorship bias**: Many tokens delist. Validation must include delisted tokens or at minimum acknowledge this limitation.
4. **Liquidity regime shifts**: A token's liquidity profile can change dramatically (new CEX listing, liquidity mining end). Train/test splits should ideally not straddle major liquidity events.
5. **Data availability**: Many tokens have < 1 year of data. Expanding windows with small `min_train_size` may be necessary.

## Practical Window Sizes for Crypto

| Strategy Timeframe | Train Window | Test Window | Embargo |
|---|---|---|---|
| Scalping (1-5min) | 3-7 days | 1 day | 2-4 hours |
| Intraday (15min-1h) | 14-30 days | 3-7 days | 12-24 hours |
| Swing (4h-daily) | 30-90 days | 7-14 days | 2-5 days |
| Position (daily-weekly) | 90-180 days | 30 days | 5-10 days |

## Quick Start

```python
from walk_forward import WalkForwardValidator, WalkForwardConfig

config = WalkForwardConfig(
    train_size=90,
    test_size=14,
    step_size=14,
    window_type="rolling",
    embargo_size=3,
    purge_horizon=1,
)

validator = WalkForwardValidator(config)
for fold in validator.split(price_data):
    model.fit(fold.train_X, fold.train_y)
    predictions = model.predict(fold.test_X)
    fold.record_performance(predictions, fold.test_y)

results = validator.aggregate_results()
print(f"OOS Sharpe: {results.oos_sharpe:.3f}")
print(f"Train/Test Sharpe ratio: {results.sharpe_ratio_ratio:.2f}")
```

## Files

### References
- `references/methodology.md` — Walk-forward theory, window types, purging, embargo, CPCV algorithm with formulas
- `references/overfit_detection.md` — Deflated Sharpe ratio, probability of backtest overfitting, multiple testing corrections
- `references/practical_guide.md` — Window size selection for crypto, regime considerations, common validation mistakes

### Scripts
- `scripts/walk_forward.py` — Walk-forward validation engine with rolling and expanding windows; `--demo` mode with synthetic data
- `scripts/overfit_detector.py` — Deflated Sharpe ratio and PBO computation; `--demo` mode with synthetic backtest results

Related Skills

yield-analysis

from agiprolabs/claude-trading-skills

DeFi yield evaluation including fee APR, real vs nominal yield, net APY after costs, and yield sustainability analysis

yellowstone-grpc

from agiprolabs/claude-trading-skills

Real-time Solana transaction and account streaming via Yellowstone gRPC (Geyser plugin)

whale-tracking

from agiprolabs/claude-trading-skills

Large wallet monitoring, accumulation and distribution detection, and smart money signal generation for Solana tokens

wash-sale-detection

from agiprolabs/claude-trading-skills

Wash sale detection under 2025 US crypto rules with 61-day window monitoring, disallowed loss tracking, and safe re-entry countdown

wallet-profiling

from agiprolabs/claude-trading-skills

Behavioral classification, performance analysis, and trading style detection for Solana wallets

volatility-modeling

from agiprolabs/claude-trading-skills

Volatility estimation, forecasting, and regime classification using GARCH, EWMA, realized volatility, and volatility cones

vectorbt

from agiprolabs/claude-trading-skills

High-performance vectorized backtesting with parameter optimization, portfolio simulation, and rich performance metrics

trading-visualization

from agiprolabs/claude-trading-skills

Professional trading charts including candlesticks, equity curves, drawdowns, correlation heatmaps, and return distributions

trade-journal

from agiprolabs/claude-trading-skills

Structured trade logging, performance review, behavioral pattern detection, and strategy attribution for systematic improvement

trade-accounting

from agiprolabs/claude-trading-skills

Double-entry bookkeeping for trading operations with ledger management, P&L statements, balance sheets, and cash flow reporting

token-holder-analysis

from agiprolabs/claude-trading-skills

Token holder distribution, concentration metrics, insider detection, and supply analysis for Solana tokens

token-economics

from agiprolabs/claude-trading-skills

Token supply dynamics, vesting analysis, inflation modeling, and valuation frameworks for crypto tokens