atft-training

Run and monitor ATFT-GAT-FAN training loops, hyper-parameter sweeps, and safety modes on A100 GPUs.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

atft-training is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Run and monitor ATFT-GAT-FAN training loops, hyper-parameter sweeps, and safety modes on A100 GPUs.

Teams using atft-training should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/atft-training/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/machine-learning/atft-training/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/atft-training/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How atft-training Compares

Feature / Agent	atft-training	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Run and monitor ATFT-GAT-FAN training loops, hyper-parameter sweeps, and safety modes on A100 GPUs.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# ATFT Training Skill

## Mission
- Launch production-grade training for the Graph Attention Network forecaster with correct dataset/version parity.
- Tune hyper-parameters (LR, batch size, horizons, latent dims) exploiting 80GB GPU headroom.
- Safely resume, stop, or monitor long-running jobs and record experiment metadata.

## Engagement Triggers
- Requests to “train”, “fine-tune”, “HP optimize”, “resume training”, or “monitor training logs”.
- Need to validate new dataset compatibility with model code.
- Investigations into training stalls, divergence, or GPU under-utilization.

## Preflight Safety Checks
1. Dataset freshness: `ls -lh output/ml_dataset_latest_full.parquet` then `python scripts/utils/dataset_guard.py --assert-recency 72`.
2. Environment health: `tools/project-health-check.sh --section training`.
3. GPU allocation: `nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv` (target >60% util, <76GB used baseline).
4. Git hygiene: `git status --short` ensure working tree state is understood (avoid accidental overrides during long runs).

## Training Playbooks

### 1. Production Optimized Training (default 120 epochs)
1. `make train-optimized DATASET=output/ml_dataset_latest_full.parquet` — compiles TorchInductor + FlashAttention2.
2. `make train-monitor` — tails `_logs/training/train-optimized.log`.
3. `make train-status` — polls background process; ensure ETA < 7h.
4. Post-run validation:
   - `python scripts/eval/aggregate_metrics.py runs/latest` — compute Sharpe, RankIC, hit ratios.
   - Update `results/latest_training_summary.md`.

### 2. Quick Validation / Smoke
1. `make train-quick EPOCHS=3` — run in foreground.
2. `python scripts/smoke_test.py --max-epochs 1 --subset 512` for additional regression guard.
3. `pytest tests/integration/test_training_loop.py::test_forward_backward` if suspicious gradients.

### 3. Safe Mode / Debug
1. `make train-safe` — disables compile, single-worker dataloading.
2. `make train-stop` if hung jobs detected (consult `_logs/training/pids/`).
3. `python scripts/integrated_ml_training_pipeline.py --profile --epochs 2 --no-compile` — capture flamegraph to `benchmark_output/`.

### 4. Hyper-Parameter Exploration
1. Ensure `mlflow` backend running if required (`make mlflow-up`).
2. `make hpo-run HPO_TRIALS=24 HPO_STUDY=atft_prod_lr_sched` — uses Optuna integration.
3. `make hpo-status` — track trial completions.
4. Promote winning config → `configs/training/atft_prod.yaml` and document in `EXPERIMENT_STATUS.md`.

## Monitoring & Telemetry
- Training logs: `_logs/training/*.log` (includes gradient norms, learning rate schedule, GPU temp).
- Metrics JSONL: `runs/<timestamp>/metrics.jsonl`.
- Checkpoint artifacts: `models/checkpoints/<timestamp>/epoch_###.pt`.
- GPU telemetry: `watch -n 30 nvidia-smi` or `python tools/gpu_monitor.py --pid $(cat _logs/training/pids/train.pid)`.

## Failure Handling
- **NaN loss** → run `make train-safe` with `FP32=1`, inspect `runs/<ts>/nan_batches.json`.
- **Slow dataloading** → regenerate dataset with `make dataset-gpu GRAPH_WINDOW=90` or enable PyTorch compile caching.
- **OOM** → set `GRADIENT_ACCUMULATION_STEPS=2` or reduce `BATCH_SIZE`; confirm memory fragments via `python tools/gpu_memory_report.py`.
- **Divergent metrics** → verify `configs/training/schedule.yaml`; run `pytest tests/unit/test_loss_functions.py`.

## Codex Collaboration
- Invoke `./tools/codex.sh --max "Design a new learning rate policy for ATFT-GAT-FAN"` when novel optimizer or architecture strategy is required.
- Use `codex exec --model gpt-5-codex "Analyze runs/<timestamp>/metrics.jsonl and suggest fixes"` for automated postmortems.
- Share Codex-discovered tuning insights in `results/training_runs/` and update config files/documents accordingly.

## Post-Training Handoff
- Persist summary in `results/training_runs/<timestamp>.md` noting dataset hash and commit SHA.
- Push model weights to `models/artifacts/` with naming `gatfan_<date>_Sharpe<score>.pt`.
- Notify research team via `docs/research/changelog.md`.

Related Skills

when-training-neural-networks-use-flow-nexus-neural

from diegosouzapw/awesome-omni-skill

This SOP provides a systematic workflow for training and deploying neural networks using Flow Nexus platform with distributed E2B sandboxes. It covers architecture selection, distributed training, ...

training-hub

from diegosouzapw/awesome-omni-skill

Fine-tune LLMs using Red Hat training-hub library with SFT, LoRA, and OSFT algorithms. Use when preparing JSONL datasets, running training jobs, configuring hardware, scaling to clusters, evaluating models, or deploying with vLLM.

ai-training-data-generation

from diegosouzapw/awesome-omni-skill

Generate high-quality training datasets from documents, text corpora, and structured content. Use when creating AI training data from dictionaries, documents, or when generating examples for machine learning models. Optimized for low-resource languages and domain-specific knowledge extraction.

atft-code-quality

from diegosouzapw/awesome-omni-skill

Enforce lint, formatting, typing, testing, and security hygiene across the ATFT-GAT-FAN codebase.

qwen_training_data_miner_prototype

from diegosouzapw/awesome-omni-skill

Qwen Training Data Miner (Prototype)

atft-pipeline

from diegosouzapw/awesome-omni-skill

Manage J-Quants ingestion, feature graph generation, and cache hygiene for the ATFT-GAT-FAN dataset pipeline.

account-aware-training

from diegosouzapw/awesome-omni-skill

Add account state (P&L, win rate, drawdown) to RL observations + drawdown penalty in rewards. Trigger when: (1) model needs account awareness, (2) training should penalize drawdowns, (3) upgrading obs_dim 5300→5600.

atft-autonomy

from diegosouzapw/awesome-omni-skill

Coordinate Claude Code skills with OpenAI Codex autonomous workflows for end-to-end ATFT-GAT-FAN maintenance.

agentdb-reinforcement-learning-training

from diegosouzapw/awesome-omni-skill

AgentDB Reinforcement Learning Training operates on 3 fundamental principles:

bgo

from diegosouzapw/awesome-omni-skill

Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.

Coding & Development

customer-discovery

from diegosouzapw/awesome-omni-skill

Find where potential customers discuss problems online and extract their language patterns. Provides starting points for community research, not exhaustive coverage.

create-prd

from diegosouzapw/awesome-omni-skill

This skill should be used when the user asks to "创建PRD", "写产品需求文档", "生成PRD", "新建PRD", "create PRD", "write product requirements document", or mentions "产品需求文档", "PRD模板". Automatically generates comprehensive Chinese PRD documents following 2026 best practices.