numerai-experiment-design

Design and manage Numerai experiments in this repo for any model idea.

1,123 stars

Best use case

numerai-experiment-design is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Design and manage Numerai experiments in this repo for any model idea.

Teams using numerai-experiment-design should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/numerai-experiment-design/SKILL.md --create-dirs "https://raw.githubusercontent.com/numerai/example-scripts/main/numerai/agents/skills/numerai-experiment-design/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/numerai-experiment-design/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How numerai-experiment-design Compares

Feature / Agentnumerai-experiment-designStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Design and manage Numerai experiments in this repo for any model idea.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Numerai Experiment Design
Use this workflow to plan, run, and report Numerai experiments for any model idea.

Note: run commands from `numerai/` (so `agents` is importable), or from repo root with `PYTHONPATH=numerai`.

## Persistence expectation (required)

This skill is *not* complete after a single promising run. You must run experiments in **rounds** (typically **4–5 configs per round**), synthesize results, and decide what to try next. Only finalize when you reach a plateau and additional rounds stop improving the primary metric.

## Planning checklist (answer before running)
- State the model idea and novelty.
- Choose the initial baseline and feature set. Default to `deep_lgbm_ender20_baseline` (feature_set=all) unless the user explicitly requests the small baseline; keep experiments' feature_set aligned with the chosen baseline.
- Decide the primary metric (`bmc_mean` and `bmc_last_200_eras`) where BMC = Benchmark Model Contribution vs official `v52_lgbm_ender20`.
- Decide which parameter dimensions to explore based on the core idea (targets, model hyperparameters, ensemble weights, data settings).
- Or decide that only a minimal round is needed because the change is tiny — but still run multiple variants unless the user explicitly requested exactly one run.

## Handling ambiguity (fast disambiguation)
If the user's request is unclear or underspecified:
1) List 2–4 plausible interpretations (keep them meaningfully different).
2) Implement **quick scout runs** for each interpretation (downsampled data, conservative compute).
3) Compare `bmc_mean` and `bmc_last_200_eras`.
4) Use the best-BMC interpretation going forward, and document the choice + rationale in `experiment.md`.

## Workflow 
Core loop (repeat for each experiment round):
1) If the model type is new, implement it with the numerai-model-implementation skill.
2) Create/update **4–5 configs** for the current round (one base + single-variable variants).
3) Run training for each config via `PYTHONPATH=numerai python3 -m agents.code.modeling --config <config> --output-dir <experiment_dir>`, which calls `pipeline.py` for CV/OOF + results.
4) Wait for the whole round to finish, then **synthesize** results:
   - pick the current best by `bmc_last_200_eras.mean` (primary), with `bmc_mean` as a tie-breaker
   - sanity-check `corr_mean` and `avg_corr_with_benchmark` (avoid “high corr, low BMC” traps)
   - check stability (drawdown/sharpe) and whether the improvement is consistent across eras
5) Update `experiment.md` with: what changed this round, the metrics table, and the next-round decision.
6) Repeat rounds until a plateau is reached (see “When to stop” below), then scale the winner.

## Scout -> Scale
1) **Use downsampled**: Use `v5.2/downsampled_full.parquet` + `v5.2/downsampled_full_benchmark_models.parquet` to save memory and time when experimenting.
2) **Pick the sweep dimension that matches the core idea**: Run a focused sweep only when it serves the research question; otherwise run a single experiment config and evaluate.
3) **Iterate until improvements stop**: Keep sweeping on that dimension while a round produces a new best metric. If a round does not improve, reassess or pivot.
4) **Focus when a parameter dominates**: If one parameter clearly drives results, dedicate a full round to mapping its range (including extremes) while holding others fixed.
5) **Scale only winners**: Once a best option is determined in the small baseline phase, move to phase 2 where you use the deep baseline and all feature_set, and scale the more expensive parameters like n_estimators and network size, if applicable.  
6) **Full data final**: Run the top config on full data and record the final metrics and final bmc when you stop finding improvements.

## When to stop (plateau criteria)

Stop iterating only when **at least two consecutive rounds** fail to beat the current best `bmc_last_200_eras.mean` by a meaningful margin (rule of thumb: ~`1e-4`–`3e-4`), *and* the remaining untried knobs are either redundant with what you already swept or likely to increase overfit/benchmark-correlation.

If you plateau on downsampled data, do *one* confirmatory scale step (bigger feature set and/or more data) before concluding the idea is maxed out.

## Sweep selection by research type
Note that these are examples only. Each idea will call for different sweeps, or no sweeps. These are some guidelines but use your judgement to determine the best experiments to run to answer the core question of "does/can this core idea produce a model that has high bmc_mean?
- **New target/label/feature engineering**: Sweep target variants or preprocessing settings; skip hyperparameter sweeps unless performance is unstable.
- **New model architecture**: Run a hyperparameter sweep (depth/width, learning rate, regularization, epochs).
- **Ensemble/blend/stacking**: Sweep combination weights, blend rules, or stacker settings.
- **Training-procedure change**: Sweep procedure-specific params (loss weights, neutralization strength, sampling).
- **Data change**: Sweep universe, era sampling, or feature-set choices.

## Sweep design guidance
- Use one-variable-at-a-time changes for each run in the chosen sweep dimension.
- Build a base config per round, then create variants that change a single parameter or variant.
- Take time to design each round based on last-round results, model type, and known sensitivities.
- If scaling depth/width/n_estimators or related parameter, consider lower learning rate and/or increase epoch in conjunction.
- Track and compare per-round results; keep the best model and document why it won.

## Baseline alignment
- Declare which baseline the model is aiming to improve on.
- Keep `feature_set` aligned with the baseline for comparisons.
- Default to ender20 (`v52_lgbm_ender20`) as the benchmark reference and plot baseline, even when sweeping; only use the small baseline when explicitly requested.

## Experiment organization
- Keep related runs under a single, well-named folder in `agents/experiments/`.
- One experiment folder = one line of inquiry.
  - `configs/` for configs
  - `logs/` for run logs
  - `predictions/` + `results/` from OOF CV
  - `experiment.md` for summary and decisions. Declare the baseline in the experiment.md. Update the experiment.md as you progress.
- Include a **baseline row** in result tables for comparisons.
- Name configs to reflect the single variable change.

## Reporting expectations
- Run experiments in **rounds** and continuously wait for the round to finish so you don't report prematurely.
- Once you complete your research and stop finding improvements, write a report for the user. It should describe learnings (what worked and what did not), include the final stats table, and run `PYTHONPATH=numerai python3 -m agents.code.analysis.show_experiment benchmark <best_model> --base-benchmark-model v52_lgbm_ender20 --benchmark-data-path numerai/v5.2/full_benchmark_models.parquet --start-era 575 --dark --output-dir <experiment_dir> --baselines-dir numerai/agents/baselines` to generate the cumulative corr + BMC plot (share the output path).
- Use `python -m agents.code.analysis.plot_benchmark_corrs` only when comparing official benchmark model columns, not for experiment BMC curves.
- Always report:
  - `bmc` (full) and `bmc_last_200_eras`
  - `corr_mean` and `avg_corr_with_benchmark` (corr vs the official benchmark predictions)
- Use consistent, markdown tables and update `experiment.md` after each run.
- Include a cohesive plan and story, finishing with a final result that combines learnings from all experiments. Think of yourself as a scientist writing a paper that walks the reader through your discoveries and thought process so that they understand why you finished with the result you did.

## Dataset handling
- Build datasets with `python -m agents.code.data.build_full_datasets`.
  - Full: `numerai/v5.2/full.parquet`, `numerai/v5.2/full_benchmark_models.parquet`
  - Downsampled (every 4 eras): `numerai/v5.2/downsampled_full.parquet`, `numerai/v5.2/downsampled_full_benchmark_models.parquet`
- Prefer downsampled for quick iteration; only scale after a clear signal for the final model.

## Useful entry points
- `PYTHONPATH=numerai python3 -m agents.code.modeling` (training + metrics)
- `agents/code/metrics/numerai_metrics.py` (BMC/corr summaries)
- `PYTHONPATH=numerai python3 -m agents.code.analysis.show_experiment` (compare runs)
- `PYTHONPATH=numerai python3 -m agents.code.data.build_full_datasets` (full + downsampled datasets)

## Deployment (after experiments complete)
Once you have finalized your best model and created a pkl file using the `numerai-model-upload` skill:

1. **Offer deployment**: Ask the user if they want to deploy the pkl to Numerai for automated submissions.

2. **Deployment options** (via the Numerai MCP server):
   - **Create a new model**: Use `create_model` to create a new model slot, then upload the pkl
   - **Upload to existing model**: List the user's existing models and upload to one they choose

3. **Follow the `numerai-model-upload` skill** for the complete deployment workflow using the MCP server tools (`create_model`, `upload_model`, `graphql_query`).

This allows the full research-to-deployment workflow to happen in a single session.

Related Skills

numerai-research

1123
from numerai/example-scripts

End-to-end Numerai research workflow for trying a new idea: design experiments, implement new model types if needed, run scout→scale experiments, write a full experiment.md report with standard plots, and optionally package/upload a Numerai pickle. Use when a user asks to “try/test a new idea”, “run an experiment”, “sweep configs”, “compare model variants”, or otherwise do new Numerai research.

numerai-model-upload

1123
from numerai/example-scripts

Create Numerai Tournament model upload pickles (.pkl) with a self-contained predict() function. Use when preparing upload artifacts, debugging numerai_predict import errors, or documenting model-upload requirements and testing steps.

numerai-model-implementation

1123
from numerai/example-scripts

Add a new Numerai model type to the agents training pipeline. Use when you need to register a model in `agents/code/modeling/utils/model_factory.py`, handle fit/predict quirks in `agents/code/modeling/utils/numerai_cv.py`, and update configs so the model can run via `python -m agents.code.modeling`.

report-research

1123
from numerai/example-scripts

Write a complete Numerai experiment report in experiment.md (abstract, methods, results tables, decisions, next steps) and generate/link the standard show_experiment plot(s). Use after running any Numerai research experiments, or when a user asks for a “full report”, “write up”, “experiment.md update”, or “generate the standard plot”.

design-spells

31392
from sickn33/antigravity-awesome-skills

Curated micro-interactions and design details that add "magic" and personality to websites and apps.

Presentation Mastery — Complete Slide Design & Delivery System

3891
from openclaw/skills

You are a Presentation Architect. You help build presentations that persuade, inform, and move people to action. You cover the full lifecycle: audience analysis → narrative structure → slide design → delivery coaching → post-presentation follow-up.

Content & Documentation

ui-designer

3891
from openclaw/skills

Design beautiful interfaces using 16+ design systems including Material You, Fluent Design, Apple HIG, Ant Design, Carbon Design, Shopify Polaris, Minimalism, Glassmorphism, Neo-Brutalism, Neumorphism, Skeuomorphism, Claymorphism, Swiss Design, and Atlassian Design. Expert in Tailwind CSS, color harmonics, component theming, and accessibility (WCAG).

UI Design & Prototyping

instructional-design-cn

3891
from openclaw/skills

培训课程大纲设计、效果评估、内部分享材料生成

Workflow & Productivity

designer-intelligence-station

3891
from openclaw/skills

设计师情报收集工具。监控 40 个公开信息源(AI/硬件/手机/设计),6 维筛选标准 v2.0(基于 120+ 条行为分析),生成结构化日报/周报。仅抓取公开内容,不登录、不提交表单、不绕过付费墙。支持依赖自动检测和安装。

Data & Research

SendTradeSignal

3891
from openclaw/skills

A specialized tool for sending quantitative trading signals to the FMZ platform via HTTP API.

Finance & Trading

ml-experiment-tracker

3891
from openclaw/skills

Plan reproducible ML experiment runs with explicit parameters, metrics, and artifacts. Use before model training to standardize tracking-ready experiment definitions.

Data & Research

ui-ux-designer

31392
from sickn33/antigravity-awesome-skills

Create interface designs, wireframes, and design systems. Masters user research, accessibility standards, and modern design tools.