atft-training
Run and monitor ATFT-GAT-FAN training loops, hyper-parameter sweeps, and safety modes on A100 GPUs.
Best use case
atft-training is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Run and monitor ATFT-GAT-FAN training loops, hyper-parameter sweeps, and safety modes on A100 GPUs.
Teams using atft-training should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/atft-training/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How atft-training Compares
| Feature / Agent | atft-training | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Run and monitor ATFT-GAT-FAN training loops, hyper-parameter sweeps, and safety modes on A100 GPUs.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# ATFT Training Skill ## Mission - Launch production-grade training for the Graph Attention Network forecaster with correct dataset/version parity. - Tune hyper-parameters (LR, batch size, horizons, latent dims) exploiting 80GB GPU headroom. - Safely resume, stop, or monitor long-running jobs and record experiment metadata. ## Engagement Triggers - Requests to “train”, “fine-tune”, “HP optimize”, “resume training”, or “monitor training logs”. - Need to validate new dataset compatibility with model code. - Investigations into training stalls, divergence, or GPU under-utilization. ## Preflight Safety Checks 1. Dataset freshness: `ls -lh output/ml_dataset_latest_full.parquet` then `python scripts/utils/dataset_guard.py --assert-recency 72`. 2. Environment health: `tools/project-health-check.sh --section training`. 3. GPU allocation: `nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv` (target >60% util, <76GB used baseline). 4. Git hygiene: `git status --short` ensure working tree state is understood (avoid accidental overrides during long runs). ## Training Playbooks ### 1. Production Optimized Training (default 120 epochs) 1. `make train-optimized DATASET=output/ml_dataset_latest_full.parquet` — compiles TorchInductor + FlashAttention2. 2. `make train-monitor` — tails `_logs/training/train-optimized.log`. 3. `make train-status` — polls background process; ensure ETA < 7h. 4. Post-run validation: - `python scripts/eval/aggregate_metrics.py runs/latest` — compute Sharpe, RankIC, hit ratios. - Update `results/latest_training_summary.md`. ### 2. Quick Validation / Smoke 1. `make train-quick EPOCHS=3` — run in foreground. 2. `python scripts/smoke_test.py --max-epochs 1 --subset 512` for additional regression guard. 3. `pytest tests/integration/test_training_loop.py::test_forward_backward` if suspicious gradients. ### 3. Safe Mode / Debug 1. `make train-safe` — disables compile, single-worker dataloading. 2. `make train-stop` if hung jobs detected (consult `_logs/training/pids/`). 3. `python scripts/integrated_ml_training_pipeline.py --profile --epochs 2 --no-compile` — capture flamegraph to `benchmark_output/`. ### 4. Hyper-Parameter Exploration 1. Ensure `mlflow` backend running if required (`make mlflow-up`). 2. `make hpo-run HPO_TRIALS=24 HPO_STUDY=atft_prod_lr_sched` — uses Optuna integration. 3. `make hpo-status` — track trial completions. 4. Promote winning config → `configs/training/atft_prod.yaml` and document in `EXPERIMENT_STATUS.md`. ## Monitoring & Telemetry - Training logs: `_logs/training/*.log` (includes gradient norms, learning rate schedule, GPU temp). - Metrics JSONL: `runs/<timestamp>/metrics.jsonl`. - Checkpoint artifacts: `models/checkpoints/<timestamp>/epoch_###.pt`. - GPU telemetry: `watch -n 30 nvidia-smi` or `python tools/gpu_monitor.py --pid $(cat _logs/training/pids/train.pid)`. ## Failure Handling - **NaN loss** → run `make train-safe` with `FP32=1`, inspect `runs/<ts>/nan_batches.json`. - **Slow dataloading** → regenerate dataset with `make dataset-gpu GRAPH_WINDOW=90` or enable PyTorch compile caching. - **OOM** → set `GRADIENT_ACCUMULATION_STEPS=2` or reduce `BATCH_SIZE`; confirm memory fragments via `python tools/gpu_memory_report.py`. - **Divergent metrics** → verify `configs/training/schedule.yaml`; run `pytest tests/unit/test_loss_functions.py`. ## Codex Collaboration - Invoke `./tools/codex.sh --max "Design a new learning rate policy for ATFT-GAT-FAN"` when novel optimizer or architecture strategy is required. - Use `codex exec --model gpt-5-codex "Analyze runs/<timestamp>/metrics.jsonl and suggest fixes"` for automated postmortems. - Share Codex-discovered tuning insights in `results/training_runs/` and update config files/documents accordingly. ## Post-Training Handoff - Persist summary in `results/training_runs/<timestamp>.md` noting dataset hash and commit SHA. - Push model weights to `models/artifacts/` with naming `gatfan_<date>_Sharpe<score>.pt`. - Notify research team via `docs/research/changelog.md`.
Related Skills
when-training-neural-networks-use-flow-nexus-neural
This SOP provides a systematic workflow for training and deploying neural networks using Flow Nexus platform with distributed E2B sandboxes. It covers architecture selection, distributed training, ...
training-hub
Fine-tune LLMs using Red Hat training-hub library with SFT, LoRA, and OSFT algorithms. Use when preparing JSONL datasets, running training jobs, configuring hardware, scaling to clusters, evaluating models, or deploying with vLLM.
ai-training-data-generation
Generate high-quality training datasets from documents, text corpora, and structured content. Use when creating AI training data from dictionaries, documents, or when generating examples for machine learning models. Optimized for low-resource languages and domain-specific knowledge extraction.
atft-code-quality
Enforce lint, formatting, typing, testing, and security hygiene across the ATFT-GAT-FAN codebase.
qwen_training_data_miner_prototype
Qwen Training Data Miner (Prototype)
atft-pipeline
Manage J-Quants ingestion, feature graph generation, and cache hygiene for the ATFT-GAT-FAN dataset pipeline.
account-aware-training
Add account state (P&L, win rate, drawdown) to RL observations + drawdown penalty in rewards. Trigger when: (1) model needs account awareness, (2) training should penalize drawdowns, (3) upgrading obs_dim 5300→5600.
atft-autonomy
Coordinate Claude Code skills with OpenAI Codex autonomous workflows for end-to-end ATFT-GAT-FAN maintenance.
agentdb-reinforcement-learning-training
AgentDB Reinforcement Learning Training operates on 3 fundamental principles:
bgo
Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.
customer-discovery
Find where potential customers discuss problems online and extract their language patterns. Provides starting points for community research, not exhaustive coverage.
create-prd
This skill should be used when the user asks to "创建PRD", "写产品需求文档", "生成PRD", "新建PRD", "create PRD", "write product requirements document", or mentions "产品需求文档", "PRD模板". Automatically generates comprehensive Chinese PRD documents following 2026 best practices.