atft-pipeline
Manage J-Quants ingestion, feature graph generation, and cache hygiene for the ATFT-GAT-FAN dataset pipeline.
Best use case
atft-pipeline is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Manage J-Quants ingestion, feature graph generation, and cache hygiene for the ATFT-GAT-FAN dataset pipeline.
Teams using atft-pipeline should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/atft-pipeline/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How atft-pipeline Compares
| Feature / Agent | atft-pipeline | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Manage J-Quants ingestion, feature graph generation, and cache hygiene for the ATFT-GAT-FAN dataset pipeline.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# ATFT Pipeline Skill
## Mission
- Provision fresh or historical parquet datasets for ATFT-GAT-FAN with GPU-accelerated ETL.
- Maintain deterministic feature graphs (approx. 395 engineered factors, 307 active).
- Guard J-Quants API quota, credential sanity, and cache health to prevent training stalls.
## When To Engage
- Any request mentioning dataset builds, ETL, J-Quants, cache, RAPIDS/cuDF, or feature graph refresh.
- Pre-training sanity checks (“ensure latest dataset”, “verify cache integrity”).
- Recovery tasks (“resume interrupted dataset job”, “clean corrupted cache shards”).
## Preflight Checklist
- Confirm `nvidia-smi` reports at least one free A100 80GB GPU; fallback to CPU only if GPU unavailable.
- Validate credentials: `.env` contains `JQUANTS_AUTH_EMAIL/PASSWORD` and `JQUANTS_PLAN_TIER`.
- Ensure `python -m pip install -e .` already executed (dependencies + entry points).
- Check latest health snapshot: `tools/project-health-check.sh --section dataset`.
- Inspect existing dataset for reuse: `ls -lh output/ml_dataset_latest_full.parquet`.
## Core Playbooks
### 1. Background Five-Year Refresh (default)
1. `make dataset-check-strict` — GPU + secrets verification.
2. `make dataset-bg START=<optional> END=<optional>` — SSH-safe background run with logging in `_logs/dataset`.
3. `tail -f _logs/dataset/*.log` — monitor progress (auto prints PID + PGID).
4. `make cache-stats` — ensure cache hit-rate & size in expected bounds (<2.5 TB).
5. `python scripts/pipelines/run_full_dataset.py --dry-run` — confirm metadata integrity without rebuild.
### 2. Hotfix / Forced Refresh
1. `make dataset-gpu-refresh START=YYYY-MM-DD END=YYYY-MM-DD` — bypasses cached parquet + API throttle aware.
2. `make datasets-prune` — keep latest dataset generation only.
3. `make cache-prune CACHE_TTL_DAYS=90` — evict stale graph shards to recover disk.
### 3. Resource-Constrained Fallback
1. `make dataset-check` — relaxed diagnostics (CPU acceptable).
2. `make dataset-cpu START=YYYY-MM-DD END=YYYY-MM-DD` — chunked Pandas path.
3. `make dataset-safe-resume` — resume from last safe checkpoint if memory pressure triggered fallback.
### 4. Graph Feature Investigation
1. `python scripts/pipelines/run_full_dataset.py --inspect-graph --start YYYY-MM-DD --end YYYY-MM-DD`.
2. `python -c "import polars as pl; df = pl.read_parquet('output/ml_dataset_latest_full.parquet'); print(df.select(pl.all().is_null().sum()))"` — null audit.
3. `make cache-monitor` — per-window edge density + overlap stats.
## Observability Hooks
- `_logs/dataset/` for job logs, `cache/*.json` metadata for cache.
- `ml_dataset_latest_full_metadata.json` for column coverage & horizon alignment.
- `benchmark_output/dataset_timestamps.json` to confirm pipeline duration vs baseline (target: <42m GPU path).
## Failure Triage
- **Credential errors** → run `python scripts/pipelines/run_full_dataset.py --auth-test`.
- **CUDA OOM** → rerun with `make dataset-safe` (40GB RMM pool pre-configured).
- **API rate limits** → throttle via `make dataset-gpu REFRESH_THROTTLE=1`.
- **Corrupted parquet** → `make dataset-rebuild` then `python tools/parquet_validator.py output/ml_dataset_latest_full.parquet`.
## Codex Collaboration
- Escalate complex ETL debugging or architectural refactors via `./tools/codex.sh "Diagnose dataset pipeline bottleneck"` (leverages OpenAI Codex deep reasoning).
- For long-running autonomous maintenance, schedule `./tools/codex.sh --max --exec "Perform full dataset pipeline audit"` off-hours (uses `.mcp.json` from the Codex repo for filesystem/git context).
- When Codex proposes changes, sync learnings back here and refresh dataset runbooks if any commands or defaults shift.
## Handoff Notes
- Always update `dataset_features_detail.json` if schema changes.
- Announce new dataset snapshot in `EXPERIMENT_STATUS.md` with generation timestamp and settings.
- Surface anomalies (missing tickers, new features) via `docs/data_quality/` reports.Related Skills
etl-pipeline
Build automated ETL (Extract-Transform-Load) pipelines for construction data. Process PDFs, Excel, BIM exports. Generate reports, dashboards, and integrate with other systems. Orchestrate with Airflow or n8n.
data-pipeline
Data pipeline and ETL automation - extract, transform, load workflows for data integration and analytics
data-pipeline-manager
Design and troubleshoot robust data pipelines with comprehensive quality validation, error handling, and monitoring capabilities for bioinformatics and data processing workflows
data-engineering-data-pipeline
You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.
book-sft-pipeline
This skill should be used when the user asks to "fine-tune on books", "create SFT dataset", "train style model", "extract ePub text", or mentions style transfer, LoRA training, book segmentation, or author voice replication.
architecture-paradigm-pipeline
Consult this skill when designing data pipelines or transformation workflows. Use when data flows through fixed sequence of transformations, stages can be independently developed and tested, parallel processing of stages is beneficial. Do not use when selecting from multiple paradigms - use architecture-paradigms first. DO NOT use when: data flow is not sequential or predictable. DO NOT use when: complex branching/merging logic dominates.
ai-content-pipeline
Build multi-step AI content creation pipelines combining image, video, audio, and text. Workflow examples: generate image -> animate -> add voiceover -> merge with music. Tools: FLUX, Veo, Kokoro TTS, OmniHuman, media merger, upscaling. Use for: YouTube videos, social media content, marketing materials, automated content. Triggers: content pipeline, ai workflow, content creation, multi-step ai, content automation, ai video workflow, generate and edit, ai content factory, automated content creation, ai production pipeline, media pipeline, content at scale
ticket-pipeline
Autonomous per-ticket pipeline that chains ticket-work, local-review, PR creation, CI watching, PR review loop, and auto-merge into a single unattended workflow with Slack notifications and policy guardrails
ml-pipeline-automation
Automate ML workflows with Airflow, Kubeflow, MLflow. Use for reproducible pipelines, retraining schedules, MLOps, or encountering task failures, dependency errors, experiment tracking issues.
ln-1000-pipeline-orchestrator
Meta-orchestrator (L0): reads kanban board, drives Stories through pipeline 300->310->400->500 in parallel via TeamCreate. Max 3 concurrent Stories. Auto squash-merge to develop on quality gate PASS.
cva-healthcare-pipeline
Complete 5-system healthcare content pipeline for regulated medical content generation. Includes LGPD data extraction (Type B), claims identification (Type A), scientific reference search (Type C), SEO optimization (Type B), and final consolidation (Type D). Validated ROI - 99.4% time reduction, 92.4% cost reduction. Use when implementing healthcare content automation, building regulated medical systems, or optimizing production pipelines.
bio-workflows-atacseq-pipeline
End-to-end ATAC-seq workflow from FASTQ files to differential accessibility and TF footprinting. Covers alignment, peak calling with MACS3, QC metrics, and optional TOBIAS footprinting. Use when running end-to-end ATAC-seq analysis from FASTQ to differential accessibility.