atft-pipeline

Manage J-Quants ingestion, feature graph generation, and cache hygiene for the ATFT-GAT-FAN dataset pipeline.

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

atft-pipeline is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Manage J-Quants ingestion, feature graph generation, and cache hygiene for the ATFT-GAT-FAN dataset pipeline.

Teams using atft-pipeline should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/atft-pipeline/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/atft-pipeline/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/atft-pipeline/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How atft-pipeline Compares

Feature / Agent	atft-pipeline	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Manage J-Quants ingestion, feature graph generation, and cache hygiene for the ATFT-GAT-FAN dataset pipeline.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# ATFT Pipeline Skill

## Mission
- Provision fresh or historical parquet datasets for ATFT-GAT-FAN with GPU-accelerated ETL.
- Maintain deterministic feature graphs (approx. 395 engineered factors, 307 active).
- Guard J-Quants API quota, credential sanity, and cache health to prevent training stalls.

## When To Engage
- Any request mentioning dataset builds, ETL, J-Quants, cache, RAPIDS/cuDF, or feature graph refresh.
- Pre-training sanity checks (“ensure latest dataset”, “verify cache integrity”).
- Recovery tasks (“resume interrupted dataset job”, “clean corrupted cache shards”).

## Preflight Checklist
- Confirm `nvidia-smi` reports at least one free A100 80GB GPU; fallback to CPU only if GPU unavailable.
- Validate credentials: `.env` contains `JQUANTS_AUTH_EMAIL/PASSWORD` and `JQUANTS_PLAN_TIER`.
- Ensure `python -m pip install -e .` already executed (dependencies + entry points).
- Check latest health snapshot: `tools/project-health-check.sh --section dataset`.
- Inspect existing dataset for reuse: `ls -lh output/ml_dataset_latest_full.parquet`.

## Core Playbooks

### 1. Background Five-Year Refresh (default)
1. `make dataset-check-strict` — GPU + secrets verification.
2. `make dataset-bg START=<optional> END=<optional>` — SSH-safe background run with logging in `_logs/dataset`.
3. `tail -f _logs/dataset/*.log` — monitor progress (auto prints PID + PGID).
4. `make cache-stats` — ensure cache hit-rate & size in expected bounds (<2.5 TB).
5. `python scripts/pipelines/run_full_dataset.py --dry-run` — confirm metadata integrity without rebuild.

### 2. Hotfix / Forced Refresh
1. `make dataset-gpu-refresh START=YYYY-MM-DD END=YYYY-MM-DD` — bypasses cached parquet + API throttle aware.
2. `make datasets-prune` — keep latest dataset generation only.
3. `make cache-prune CACHE_TTL_DAYS=90` — evict stale graph shards to recover disk.

### 3. Resource-Constrained Fallback
1. `make dataset-check` — relaxed diagnostics (CPU acceptable).
2. `make dataset-cpu START=YYYY-MM-DD END=YYYY-MM-DD` — chunked Pandas path.
3. `make dataset-safe-resume` — resume from last safe checkpoint if memory pressure triggered fallback.

### 4. Graph Feature Investigation
1. `python scripts/pipelines/run_full_dataset.py --inspect-graph --start YYYY-MM-DD --end YYYY-MM-DD`.
2. `python -c "import polars as pl; df = pl.read_parquet('output/ml_dataset_latest_full.parquet'); print(df.select(pl.all().is_null().sum()))"` — null audit.
3. `make cache-monitor` — per-window edge density + overlap stats.

## Observability Hooks
- `_logs/dataset/` for job logs, `cache/*.json` metadata for cache.
- `ml_dataset_latest_full_metadata.json` for column coverage & horizon alignment.
- `benchmark_output/dataset_timestamps.json` to confirm pipeline duration vs baseline (target: <42m GPU path).

## Failure Triage
- **Credential errors** → run `python scripts/pipelines/run_full_dataset.py --auth-test`.
- **CUDA OOM** → rerun with `make dataset-safe` (40GB RMM pool pre-configured).
- **API rate limits** → throttle via `make dataset-gpu REFRESH_THROTTLE=1`.
- **Corrupted parquet** → `make dataset-rebuild` then `python tools/parquet_validator.py output/ml_dataset_latest_full.parquet`.

## Codex Collaboration
- Escalate complex ETL debugging or architectural refactors via `./tools/codex.sh "Diagnose dataset pipeline bottleneck"` (leverages OpenAI Codex deep reasoning).
- For long-running autonomous maintenance, schedule `./tools/codex.sh --max --exec "Perform full dataset pipeline audit"` off-hours (uses `.mcp.json` from the Codex repo for filesystem/git context).
- When Codex proposes changes, sync learnings back here and refresh dataset runbooks if any commands or defaults shift.

## Handoff Notes
- Always update `dataset_features_detail.json` if schema changes.
- Announce new dataset snapshot in `EXPERIMENT_STATUS.md` with generation timestamp and settings.
- Surface anomalies (missing tickers, new features) via `docs/data_quality/` reports.

Related Skills

etl-pipeline

from diegosouzapw/awesome-omni-skill

Build automated ETL (Extract-Transform-Load) pipelines for construction data. Process PDFs, Excel, BIM exports. Generate reports, dashboards, and integrate with other systems. Orchestrate with Airflow or n8n.

data-pipeline

from diegosouzapw/awesome-omni-skill

Data pipeline and ETL automation - extract, transform, load workflows for data integration and analytics

data-pipeline-manager

from diegosouzapw/awesome-omni-skill

Design and troubleshoot robust data pipelines with comprehensive quality validation, error handling, and monitoring capabilities for bioinformatics and data processing workflows

data-engineering-data-pipeline

from diegosouzapw/awesome-omni-skill

You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

book-sft-pipeline

from diegosouzapw/awesome-omni-skill

This skill should be used when the user asks to "fine-tune on books", "create SFT dataset", "train style model", "extract ePub text", or mentions style transfer, LoRA training, book segmentation, or author voice replication.

architecture-paradigm-pipeline

from diegosouzapw/awesome-omni-skill

Consult this skill when designing data pipelines or transformation workflows. Use when data flows through fixed sequence of transformations, stages can be independently developed and tested, parallel processing of stages is beneficial. Do not use when selecting from multiple paradigms - use architecture-paradigms first. DO NOT use when: data flow is not sequential or predictable. DO NOT use when: complex branching/merging logic dominates.

ai-content-pipeline

from diegosouzapw/awesome-omni-skill

Build multi-step AI content creation pipelines combining image, video, audio, and text. Workflow examples: generate image -> animate -> add voiceover -> merge with music. Tools: FLUX, Veo, Kokoro TTS, OmniHuman, media merger, upscaling. Use for: YouTube videos, social media content, marketing materials, automated content. Triggers: content pipeline, ai workflow, content creation, multi-step ai, content automation, ai video workflow, generate and edit, ai content factory, automated content creation, ai production pipeline, media pipeline, content at scale

ticket-pipeline

from diegosouzapw/awesome-omni-skill

Autonomous per-ticket pipeline that chains ticket-work, local-review, PR creation, CI watching, PR review loop, and auto-merge into a single unattended workflow with Slack notifications and policy guardrails

ml-pipeline-automation

from diegosouzapw/awesome-omni-skill

Automate ML workflows with Airflow, Kubeflow, MLflow. Use for reproducible pipelines, retraining schedules, MLOps, or encountering task failures, dependency errors, experiment tracking issues.

ln-1000-pipeline-orchestrator

from diegosouzapw/awesome-omni-skill

Meta-orchestrator (L0): reads kanban board, drives Stories through pipeline 300->310->400->500 in parallel via TeamCreate. Max 3 concurrent Stories. Auto squash-merge to develop on quality gate PASS.

cva-healthcare-pipeline

from diegosouzapw/awesome-omni-skill

Complete 5-system healthcare content pipeline for regulated medical content generation. Includes LGPD data extraction (Type B), claims identification (Type A), scientific reference search (Type C), SEO optimization (Type B), and final consolidation (Type D). Validated ROI - 99.4% time reduction, 92.4% cost reduction. Use when implementing healthcare content automation, building regulated medical systems, or optimizing production pipelines.

bio-workflows-atacseq-pipeline

from diegosouzapw/awesome-omni-skill

End-to-end ATAC-seq workflow from FASTQ files to differential accessibility and TF footprinting. Covers alignment, peak calling with MACS3, QC metrics, and optional TOBIAS footprinting. Use when running end-to-end ATAC-seq analysis from FASTQ to differential accessibility.