machine-learning-ops-ml-pipeline

Design and implement a complete ML pipeline for: $ARGUMENTS

31,392 stars
Complexity: easy

About this skill

This skill empowers an AI agent to design and implement complete Machine Learning (ML) pipelines tailored to specific requirements ($ARGUMENTS). It focuses on MLOps principles, guiding the agent in creating robust, scalable, and automated workflows. As part of the 'antigravity-awesome-skills' collection, this skill enhances the agent's ability to tackle complex data processing and ML lifecycle management tasks. It provides a structured approach for the agent to develop best practices, checklists, and strategic guidance for multi-agent MLOps orchestration, ensuring efficient and reproducible ML system development.

Best use case

This skill is invaluable when an AI agent needs to undertake tasks or workflows related to machine learning pipeline design, implementation, or multi-agent MLOps orchestration. It also serves as a resource when the agent requires guidance, best practices, or checklists for setting up or improving MLOps processes.

Design and implement a complete ML pipeline for: $ARGUMENTS

The AI agent will produce a structured design, plan, or detailed guidance for an ML pipeline, incorporating MLOps best practices and potentially outlining a multi-agent orchestration strategy. This output may include workflow diagrams, tool recommendations, task breakdowns, and a list of considerations for robust implementation.

Practical example

Example input

Design and implement a complete ML pipeline for: real-time anomaly detection in network traffic for cybersecurity.

Example output

## ML Pipeline Design for Real-time Anomaly Detection

**1. Goal Clarification:**
   - **Objective:** Detect unusual patterns in network traffic indicative of cyber threats (e.g., DDoS attacks, intrusion attempts).
   - **Constraints:** Real-time processing, low latency inference, high data volume, continuous model retraining, explainability.
   - **Required Inputs:** Network flow data (e.g., NetFlow, IPFIX), packet inspection data, historical labeled anomaly data.

**2. Pipeline Stages:**
   a. **Data Ingestion (Streaming):**
      - **Tools:** Apache Kafka, AWS Kinesis, Google Pub/Sub.
      - **Process:** Ingest raw network logs and metrics from various network devices.
   b. **Data Preprocessing & Feature Engineering:**
      - **Tools:** Apache Flink, Spark Streaming, custom Python/Rust services.
      - **Process:** Parse logs, extract features (e.g., packet size, destination ports, traffic volume over time windows, entropy of source IPs), normalize data. Handle missing values.
   c. **Model Training & Retraining (Batch/Online):**
      - **Tools:** Kubeflow, MLflow, AWS Sagemaker, Azure ML, Google AI Platform.
      - **Process:** Train anomaly detection models (e.g., Isolation Forest, Autoencoders, LSTM for sequential data). Implement MLOps for automated retraining triggers (e.g., data drift, performance degradation).
   d. **Model Deployment & Inference (Real-time):**
      - **Tools:** Kubernetes (with KServe/Seldon Core), AWS Lambda, Azure Functions, GCP Cloud Run.
      - **Process:** Deploy trained models as microservices. Implement low-latency inference endpoints. Integrate with threat intelligence platforms.
   e. **Model Monitoring & Alerting:**
      - **Tools:** Prometheus/Grafana, ELK Stack, custom dashboards.
      - **Process:** Monitor model performance (precision, recall, F1 for anomalies), data drift, concept drift, latency, resource utilization. Set up alerts for critical anomalies and model health issues.
   f. **Feedback Loop & Data Labeling:**
      - **Tools:** Custom dashboards, annotation tools.
      - **Process:** Collect feedback on detected anomalies, allow security analysts to label true/false positives, use this labeled data for future retraining.

**3. MLOps Best Practices:**
   - **Version Control:** Git for code, DVC/MLflow for data/models.
   - **CI/CD:** Jenkins, GitLab CI, GitHub Actions for automated testing, building, and deployment.
   - **Experiment Tracking:** MLflow, Weights & Biases for tracking experiments, metrics, and model artifacts.
   - **Reproducibility:** Docker/Kubernetes for containerization.
   - **Security:** Secure model APIs, data encryption, access control.

**4. Multi-Agent Orchestration (Conceptual):**
   - **Data Ingestion Agent:** Responsible for ensuring continuous, reliable data flow.
   - **Preprocessing Agent:** Transforms raw data into features, handles data quality.
   - **Training Agent:** Manages model training, hyperparameter tuning, and versioning.
   - **Deployment Agent:** Handles model deployment to production, ensuring scalability and uptime.
   - **Monitoring Agent:** Observes model performance, data integrity, and triggers alerts/retraining.
   - **Feedback Agent:** Collects user feedback and integrates it into the data pipeline for model improvement.
   Each agent could communicate via message queues (e.g., Kafka) and orchestrate tasks using a central workflow engine (e.g., Apache Airflow, Kubeflow Pipelines).

When to use this skill

  • Use this skill when you need the AI agent to: - Strategize, plan, or outline a complete MLOps workflow. - Generate best practices, guidance, or checklists for ML pipeline development. - Design a system for orchestrating ML tasks using a multi-agent approach. - Tackle complex challenges in the machine learning lifecycle from data ingestion to deployment and monitoring.

When not to use this skill

  • Do not use this skill when: - The task is entirely unrelated to machine learning pipelines or MLOps orchestration. - You require a domain-specific tool or capability that falls outside the scope of ML pipeline design and implementation (e.g., direct data analysis, specific API calls for external services not part of MLOps orchestration, unless the skill is extended to cover those).

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/machine-learning-ops-ml-pipeline/SKILL.md --create-dirs "https://raw.githubusercontent.com/sickn33/antigravity-awesome-skills/main/plugins/antigravity-awesome-skills-claude/skills/machine-learning-ops-ml-pipeline/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/machine-learning-ops-ml-pipeline/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How machine-learning-ops-ml-pipeline Compares

Feature / Agentmachine-learning-ops-ml-pipelineStandard Approach
Platform SupportClaudeLimited / Varies
Context Awareness High Baseline
Installation ComplexityeasyN/A

Frequently Asked Questions

What does this skill do?

Design and implement a complete ML pipeline for: $ARGUMENTS

Which AI agents support this skill?

This skill is designed for Claude.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Machine Learning Pipeline - Multi-Agent MLOps Orchestration

Design and implement a complete ML pipeline for: $ARGUMENTS

## Use this skill when

- Working on machine learning pipeline - multi-agent mlops orchestration tasks or workflows
- Needing guidance, best practices, or checklists for machine learning pipeline - multi-agent mlops orchestration

## Do not use this skill when

- The task is unrelated to machine learning pipeline - multi-agent mlops orchestration
- You need a different domain or tool outside this scope

## Instructions

- Clarify goals, constraints, and required inputs.
- Apply relevant best practices and validate outcomes.
- Provide actionable steps and verification.
- If detailed examples are required, open `resources/implementation-playbook.md`.

## Thinking

This workflow orchestrates multiple specialized agents to build a production-ready ML pipeline following modern MLOps best practices. The approach emphasizes:

- **Phase-based coordination**: Each phase builds upon previous outputs, with clear handoffs between agents
- **Modern tooling integration**: MLflow/W&B for experiments, Feast/Tecton for features, KServe/Seldon for serving
- **Production-first mindset**: Every component designed for scale, monitoring, and reliability
- **Reproducibility**: Version control for data, models, and infrastructure
- **Continuous improvement**: Automated retraining, A/B testing, and drift detection

The multi-agent approach ensures each aspect is handled by domain experts:
- Data engineers handle ingestion and quality
- Data scientists design features and experiments
- ML engineers implement training pipelines
- MLOps engineers handle production deployment
- Observability engineers ensure monitoring

## Phase 1: Data & Requirements Analysis

<Task>
subagent_type: data-engineer
prompt: |
  Analyze and design data pipeline for ML system with requirements: $ARGUMENTS

  Deliverables:
  1. Data source audit and ingestion strategy:
     - Source systems and connection patterns
     - Schema validation using Pydantic/Great Expectations
     - Data versioning with DVC or lakeFS
     - Incremental loading and CDC strategies

  2. Data quality framework:
     - Profiling and statistics generation
     - Anomaly detection rules
     - Data lineage tracking
     - Quality gates and SLAs

  3. Storage architecture:
     - Raw/processed/feature layers
     - Partitioning strategy
     - Retention policies
     - Cost optimization

  Provide implementation code for critical components and integration patterns.
</Task>

<Task>
subagent_type: data-scientist
prompt: |
  Design feature engineering and model requirements for: $ARGUMENTS
  Using data architecture from: {phase1.data-engineer.output}

  Deliverables:
  1. Feature engineering pipeline:
     - Transformation specifications
     - Feature store schema (Feast/Tecton)
     - Statistical validation rules
     - Handling strategies for missing data/outliers

  2. Model requirements:
     - Algorithm selection rationale
     - Performance metrics and baselines
     - Training data requirements
     - Evaluation criteria and thresholds

  3. Experiment design:
     - Hypothesis and success metrics
     - A/B testing methodology
     - Sample size calculations
     - Bias detection approach

  Include feature transformation code and statistical validation logic.
</Task>

## Phase 2: Model Development & Training

<Task>
subagent_type: ml-engineer
prompt: |
  Implement training pipeline based on requirements: {phase1.data-scientist.output}
  Using data pipeline: {phase1.data-engineer.output}

  Build comprehensive training system:
  1. Training pipeline implementation:
     - Modular training code with clear interfaces
     - Hyperparameter optimization (Optuna/Ray Tune)
     - Distributed training support (Horovod/PyTorch DDP)
     - Cross-validation and ensemble strategies

  2. Experiment tracking setup:
     - MLflow/Weights & Biases integration
     - Metric logging and visualization
     - Artifact management (models, plots, data samples)
     - Experiment comparison and analysis tools

  3. Model registry integration:
     - Version control and tagging strategy
     - Model metadata and lineage
     - Promotion workflows (dev -> staging -> prod)
     - Rollback procedures

  Provide complete training code with configuration management.
</Task>

<Task>
subagent_type: python-pro
prompt: |
  Optimize and productionize ML code from: {phase2.ml-engineer.output}

  Focus areas:
  1. Code quality and structure:
     - Refactor for production standards
     - Add comprehensive error handling
     - Implement proper logging with structured formats
     - Create reusable components and utilities

  2. Performance optimization:
     - Profile and optimize bottlenecks
     - Implement caching strategies
     - Optimize data loading and preprocessing
     - Memory management for large-scale training

  3. Testing framework:
     - Unit tests for data transformations
     - Integration tests for pipeline components
     - Model quality tests (invariance, directional)
     - Performance regression tests

  Deliver production-ready, maintainable code with full test coverage.
</Task>

## Phase 3: Production Deployment & Serving

<Task>
subagent_type: mlops-engineer
prompt: |
  Design production deployment for models from: {phase2.ml-engineer.output}
  With optimized code from: {phase2.python-pro.output}

  Implementation requirements:
  1. Model serving infrastructure:
     - REST/gRPC APIs with FastAPI/TorchServe
     - Batch prediction pipelines (Airflow/Kubeflow)
     - Stream processing (Kafka/Kinesis integration)
     - Model serving platforms (KServe/Seldon Core)

  2. Deployment strategies:
     - Blue-green deployments for zero downtime
     - Canary releases with traffic splitting
     - Shadow deployments for validation
     - A/B testing infrastructure

  3. CI/CD pipeline:
     - GitHub Actions/GitLab CI workflows
     - Automated testing gates
     - Model validation before deployment
     - ArgoCD for GitOps deployment

  4. Infrastructure as Code:
     - Terraform modules for cloud resources
     - Helm charts for Kubernetes deployments
     - Docker multi-stage builds for optimization
     - Secret management with Vault/Secrets Manager

  Provide complete deployment configuration and automation scripts.
</Task>

<Task>
subagent_type: kubernetes-architect
prompt: |
  Design Kubernetes infrastructure for ML workloads from: {phase3.mlops-engineer.output}

  Kubernetes-specific requirements:
  1. Workload orchestration:
     - Training job scheduling with Kubeflow
     - GPU resource allocation and sharing
     - Spot/preemptible instance integration
     - Priority classes and resource quotas

  2. Serving infrastructure:
     - HPA/VPA for autoscaling
     - KEDA for event-driven scaling
     - Istio service mesh for traffic management
     - Model caching and warm-up strategies

  3. Storage and data access:
     - PVC strategies for training data
     - Model artifact storage with CSI drivers
     - Distributed storage for feature stores
     - Cache layers for inference optimization

  Provide Kubernetes manifests and Helm charts for entire ML platform.
</Task>

## Phase 4: Monitoring & Continuous Improvement

<Task>
subagent_type: observability-engineer
prompt: |
  Implement comprehensive monitoring for ML system deployed in: {phase3.mlops-engineer.output}
  Using Kubernetes infrastructure: {phase3.kubernetes-architect.output}

  Monitoring framework:
  1. Model performance monitoring:
     - Prediction accuracy tracking
     - Latency and throughput metrics
     - Feature importance shifts
     - Business KPI correlation

  2. Data and model drift detection:
     - Statistical drift detection (KS test, PSI)
     - Concept drift monitoring
     - Feature distribution tracking
     - Automated drift alerts and reports

  3. System observability:
     - Prometheus metrics for all components
     - Grafana dashboards for visualization
     - Distributed tracing with Jaeger/Zipkin
     - Log aggregation with ELK/Loki

  4. Alerting and automation:
     - PagerDuty/Opsgenie integration
     - Automated retraining triggers
     - Performance degradation workflows
     - Incident response runbooks

  5. Cost tracking:
     - Resource utilization metrics
     - Cost allocation by model/experiment
     - Optimization recommendations
     - Budget alerts and controls

  Deliver monitoring configuration, dashboards, and alert rules.
</Task>

## Configuration Options

- **experiment_tracking**: mlflow | wandb | neptune | clearml
- **feature_store**: feast | tecton | databricks | custom
- **serving_platform**: kserve | seldon | torchserve | triton
- **orchestration**: kubeflow | airflow | prefect | dagster
- **cloud_provider**: aws | azure | gcp | multi-cloud
- **deployment_mode**: realtime | batch | streaming | hybrid
- **monitoring_stack**: prometheus | datadog | newrelic | custom

## Success Criteria

1. **Data Pipeline Success**:
   - < 0.1% data quality issues in production
   - Automated data validation passing 99.9% of time
   - Complete data lineage tracking
   - Sub-second feature serving latency

2. **Model Performance**:
   - Meeting or exceeding baseline metrics
   - < 5% performance degradation before retraining
   - Successful A/B tests with statistical significance
   - No undetected model drift > 24 hours

3. **Operational Excellence**:
   - 99.9% uptime for model serving
   - < 200ms p99 inference latency
   - Automated rollback within 5 minutes
   - Complete observability with < 1 minute alert time

4. **Development Velocity**:
   - < 1 hour from commit to production
   - Parallel experiment execution
   - Reproducible training runs
   - Self-service model deployment

5. **Cost Efficiency**:
   - < 20% infrastructure waste
   - Optimized resource allocation
   - Automatic scaling based on load
   - Spot instance utilization > 60%

## Final Deliverables

Upon completion, the orchestrated pipeline will provide:
- End-to-end ML pipeline with full automation
- Comprehensive documentation and runbooks
- Production-ready infrastructure as code
- Complete monitoring and alerting system
- CI/CD pipelines for continuous improvement
- Cost optimization and scaling strategies
- Disaster recovery and rollback procedures

Related Skills

ml-pipeline-workflow

31392
from sickn33/antigravity-awesome-skills

Complete end-to-end MLOps pipeline orchestration from data preparation through model deployment.

Machine Learning Operations (MLOps)Claude

mlops-engineer

31392
from sickn33/antigravity-awesome-skills

Build comprehensive ML pipelines, experiment tracking, and model registries with MLflow, Kubeflow, and modern MLOps tools.

Machine Learning Operations (MLOps)Claude

deployment-pipeline-design

31392
from sickn33/antigravity-awesome-skills

Architecture patterns for multi-stage CI/CD pipelines with approval gates and deployment strategies.

DevOps & InfrastructureClaude

data-engineering-data-pipeline

31392
from sickn33/antigravity-awesome-skills

You are a data pipeline architecture expert specializing in scalable, reliable, and cost-effective data pipelines for batch and streaming data processing.

Text AnalysisClaude

nft-standards

31392
from sickn33/antigravity-awesome-skills

Master ERC-721 and ERC-1155 NFT standards, metadata best practices, and advanced NFT features.

Web3 & BlockchainClaude

nextjs-app-router-patterns

31392
from sickn33/antigravity-awesome-skills

Comprehensive patterns for Next.js 14+ App Router architecture, Server Components, and modern full-stack React development.

Web FrameworksClaude

new-rails-project

31392
from sickn33/antigravity-awesome-skills

Create a new Rails project

Code GenerationClaude

networkx

31392
from sickn33/antigravity-awesome-skills

NetworkX is a Python package for creating, manipulating, and analyzing complex networks and graphs.

Network AnalysisClaude

network-engineer

31392
from sickn33/antigravity-awesome-skills

Expert network engineer specializing in modern cloud networking, security architectures, and performance optimization.

Network EngineeringClaude

nestjs-expert

31392
from sickn33/antigravity-awesome-skills

You are an expert in Nest.js with deep knowledge of enterprise-grade Node.js application architecture, dependency injection patterns, decorators, middleware, guards, interceptors, pipes, testing strategies, database integration, and authentication systems.

Frameworks & LibrariesClaude

nerdzao-elite

31392
from sickn33/antigravity-awesome-skills

Senior Elite Software Engineer (15+) and Senior Product Designer. Full workflow with planning, architecture, TDD, clean code, and pixel-perfect UX validation.

Software DevelopmentClaude

nerdzao-elite-gemini-high

31392
from sickn33/antigravity-awesome-skills

Modo Elite Coder + UX Pixel-Perfect otimizado especificamente para Gemini 3.1 Pro High. Workflow completo com foco em qualidade máxima e eficiência de tokens.

Software DevelopmentClaudeGemini