ML Experiment Tracking
Track machine learning experiments with reproducible parameters and metrics
Best use case
ML Experiment Tracking is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Track machine learning experiments with reproducible parameters and metrics
Teams using ML Experiment Tracking should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/ml-experiment-tracking/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How ML Experiment Tracking Compares
| Feature / Agent | ML Experiment Tracking | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Track machine learning experiments with reproducible parameters and metrics
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# ML Experiment Tracking Skill Track machine learning experiments with reproducible parameters and metrics. ## Trigger Conditions - Model configuration changes or hyperparameter updates - New experiment run initiated - User invokes with "track experiment" or "compare models" ## Input Contract - **Required:** Experiment parameters (model, hyperparameters, data) - **Required:** Evaluation metrics - **Optional:** Baseline comparison, hypothesis ## Output Contract - Experiment log entry with full reproducibility info - Comparison table against baseline/prior runs - Recommendation on whether to promote or iterate ## Tool Permissions - **Read:** Model configs, training data metadata, metric logs - **Write:** Experiment logs, comparison reports - **Execute:** Metric collection commands ## Execution Steps 1. Record experiment hypothesis and parameters 2. Capture environment (dependencies, data version, code commit) 3. Execute or observe training run 4. Collect metrics and artifacts 5. Compare against baseline and prior experiments 6. Recommend: promote, iterate, or abandon ## Success Criteria - Experiment is fully reproducible from logged parameters - Metrics compared against baseline - Clear recommendation with rationale ## Escalation Rules - Escalate if model performance degrades vs. baseline - Escalate if data drift detected in training set - Escalate if experiment requires new infrastructure ## Example Invocations **Input:** "Compare the BERT-base and DistilBERT models for our classification task" **Output:** Experiment log: BERT-base (F1: 0.92, latency: 45ms, size: 440MB) vs DistilBERT (F1: 0.89, latency: 12ms, size: 260MB). Recommendation: DistilBERT for production (3% F1 trade-off for 73% latency improvement). Promote to staging for A/B test.
Related Skills
asset-tracking
Use when managing asset metadata, dependencies, and delivery workflows across teams.
analytics-tracking
(中文)When the user wants to set up, improve, or audit analytics tracking and measurement. Also use when the user mentions "set up tracking," "GA4," "Google Analytics," "conversion tracking," "event tracking," "UTM parameters," "tag manager," "GTM," "analytics implementation," or "tracking plan." For A/B test measurement, see ab-test-setup.
prediction-tracking
Track and evaluate AI predictions over time to assess accuracy. Use when reviewing past predictions to determine if they came true, failed, or remain uncertain.
aiwf:error-tracking
Add Sentry v8 error tracking and performance monitoring to your project services. Use this skill when adding error handling, creating new controllers, instrumenting cron jobs, or tracking database performance. ALL ERRORS MUST BE CAPTURED TO SENTRY - no exceptions.
artifact-tracking
Token-efficient tracking for AI orchestration. CLI-first for status updates (~50 tokens), agent fallback for complex ops (~1KB). Use when: updating task status, querying blockers, creating progress files, validating phases.
agentic-kpi-tracking
Track and measure agentic coding KPIs for ZTE progression. Use when measuring workflow effectiveness, tracking Size/Attempts/Streak/Presence metrics, or assessing readiness for autonomous operation.
bgo
Automates the complete Blender build-go workflow, from building and packaging your extension/add-on to removing old versions, installing, enabling, and launching Blender for quick testing and iteration.
obsidian-daily
Manage Obsidian Daily Notes via obsidian-cli. Create and open daily notes, append entries (journals, logs, tasks, links), read past notes by date, and search vault content. Handles relative dates like "yesterday", "last Friday", "3 days ago".
obsidian-additions
Create supplementary materials attached to existing notes: experiments, meetings, reports, logs, conspectuses, practice sessions, annotations, AI outputs, links collections. Two-step process: (1) create aggregator space, (2) create concrete addition in base/additions/. INVOKE when user wants to attach any supplementary material to an existing note. Triggers: "addition", "create addition", "experiment", "meeting notes", "report", "conspectus", "log", "practice", "annotations", "links", "link collection", "аддишн", "конспект", "встреча", "отчёт", "эксперимент", "практика", "аннотации", "ссылки", "добавь к заметке".
observe
Query and manage Observe using the Observe CLI. Use when the user wants to run OPAL queries, list datasets, manage objects, or interact with their Observe tenant from the command line.
observability-review
AI agent that analyzes operational signals (metrics, logs, traces, alerts, SLO/SLI reports) from observability platforms (Prometheus, Datadog, New Relic, CloudWatch, Grafana, Elastic) and produces practical, risk-aware triage and recommendations. Use when reviewing system health, investigating performance issues, analyzing monitoring data, evaluating service reliability, or providing SRE analysis of operational metrics. Distinguishes between critical issues requiring action, items needing investigation, and informational observations requiring no action.
nvidia-nim
NVIDIA NIM inference microservices for deploying AI models with OpenAI-compatible APIs, self-hosted or cloud