ci-fix-pipeline

Enable self-healing mode: retry loop with strategy rotation and inbox-wait (default: false)

16 stars

bydiegosouzapw

View on GitHub Installation ↓

Best use case

ci-fix-pipeline is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Enable self-healing mode: retry loop with strategy rotation and inbox-wait (default: false)

Teams using ci-fix-pipeline should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/ci-fix-pipeline/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/devops/ci-fix-pipeline/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/ci-fix-pipeline/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How ci-fix-pipeline Compares

Feature / Agent	ci-fix-pipeline	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Enable self-healing mode: retry loop with strategy rotation and inbox-wait (default: false)

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# CI Fix Pipeline

## Overview

Autonomous pipeline that fetches GitHub Actions CI failures and fixes them -- ALL failures by
default. No selective mode. Failures beyond `max_fix_files` trigger sub-ticket creation and
continue with the remaining fixable failures.

**v2.0 Self-Healing Mode (OMN-2829)**: When `--self-heal` is enabled, the pipeline enters a
multi-attempt repair loop with strategy rotation. Each attempt uses a different fix strategy
(targeted -> broad lint -> regenerate). Between attempts, the pipeline uses inbox-wait (not
polling) to detect CI re-run results. The `node_ci_repair_effect` ONEX node orchestrates this
loop.

**Workflow (standard)**: Fetch CI failures -> Slack start -> Classify + Sub-ticket large-scope -> Fix ALL fixable -> Commit -> Slack complete -> ModelSkillResult

**Workflow (self-healing)**: `_bin/ci-status.sh` -> detect failures -> attempt 1 (targeted fix) -> push -> inbox-wait -> if still failing -> attempt 2 (broad lint fix) -> push -> inbox-wait -> if still failing -> attempt 3 (regenerate) -> push -> inbox-wait -> inbox notification on success/exhaustion

**Announce at start:** "I'm using the ci-fix-pipeline skill to fix CI failures."

## Policy Defaults

```yaml
policy:
  fix_all: true                     # always fix all -- no selective mode
  max_fix_files: 10                 # files in scope trigger sub-ticket (not skip)
  fix_preexisting_in_touched: true  # fix pre-existing issues in touched files
  slack_on_start: true              # notify Slack before fixing
  slack_on_complete: true           # notify Slack with fix summary
  max_attempts: 3                   # self-heal: retry budget (1-3)
  self_heal: false                  # self-heal: enable multi-attempt loop
```

## Quick Start

```
/ci-fix-pipeline                              # Fix all CI failures on current branch
/ci-fix-pipeline --pr 42                      # Fix failures for PR #42
/ci-fix-pipeline --ticket-id OMN-1234         # Include ticket context in Slack messages
/ci-fix-pipeline --no-slack                   # Suppress Slack notifications
/ci-fix-pipeline --skip-patterns "test_*"     # Skip jobs/steps matching pattern
/ci-fix-pipeline --max-fix-files 20           # Raise the sub-ticket threshold
/ci-fix-pipeline --self-heal --pr 42          # Self-healing mode with retry loop
/ci-fix-pipeline --self-heal --max-attempts 2 # Limit to 2 attempts
```

## Arguments

| Argument | Default | Description |
|----------|---------|-------------|
| `--pr <number>` | none | PR number for CI failure fetch |
| `--branch <ref>` | current | Branch name for CI failure fetch |
| `--skip-patterns <patterns>` | none | Comma-separated job/step name patterns to skip |
| `--max-fix-files <n>` | 10 | Files-in-scope threshold; above this -> sub-ticket |
| `--no-slack` | false | Disable Slack notifications |
| `--ticket-id <id>` | none | Context ticket ID for Slack messages |
| `--self-heal` | false | Enable self-healing retry loop with strategy rotation |
| `--max-attempts <n>` | 3 | Maximum repair attempts (only with --self-heal) |

## Execution Phases

### Phase 1: Fetch CI Failures

Dispatch to polymorphic agent:

```
Task(
  subagent_type="onex:polymorphic-agent",
  description="Fetch CI failures for ci-fix-pipeline",
  prompt="Fetch CI failures using the ci-failures skill.

    Run: ${CLAUDE_PLUGIN_ROOT}/skills/ci-failures/ci-quick-review {N | branch_name}

    Return the raw JSON from ci-quick-review (pass through unchanged).
    The response has structure:
    {\"repository\": str, \"pr_number\": int,
     \"summary\": {\"total\": N, \"critical\": N, \"major\": N, \"minor\": N},
     \"failures\": [{\"workflow\": str, \"job\": str, \"job_id\": str, \"step\": str,
                    \"severity\": str, \"workflow_id\": str, \"job_url\": str}],
     \"fetched_at\": str}"
)
```

**Branch resolution**: After Phase 1, the orchestrator must have a branch name for use in later
phases. Resolve as follows:
- If `--branch` was provided: use that value directly.
- If `--pr` was provided (no `--branch`): run `gh pr view {N} --json headRefName --jq '.headRefName'`
  to get the branch name.
- If neither was provided: use the current branch (already known from `ci-quick-review` invocation).

### Phase 2: Slack Start Notification

If `slack_on_start: true` and Slack is available, notify:

```
ci-fix-pipeline starting
  PR/Branch: {context}
  Failures found: {N} ({critical} critical, {major} major, {minor} minor)
  Ticket: {ticket_id if provided}
```

Skip silently if Slack unavailable (non-blocking).

### Phase 3: Classify and Route Failures

For each failure:

1. **Skip check**: Does the failure `job` or `step` name match any `--skip-patterns` pattern?
   - Yes → mark as `skipped`, record reason
   - No → continue

2. **Scope check**: Does the failure `job` touch more than `max_fix_files` files?
   Determine scope by inspecting the job logs (via `gh api repos/{repo}/actions/jobs/{job_id}/logs`)
   to count affected files. If log inspection is unavailable, treat scope as within threshold.
   - Scope > max_fix_files → mark as `capped`, create Linear sub-ticket inline (see Sub-Ticket Creation below), continue to next failure
   - Scope ≤ max_fix_files → add to fix queue

Result: failures split into `skipped`, `capped`, and `to_fix` buckets.

### Phase 4: Fix Failures

Dispatch one polymorphic agent per severity group (critical first, then major, then minor) for all `to_fix` failures:

```
Task(
  subagent_type="onex:polymorphic-agent",
  description="Fix {severity} CI failures",
  prompt="**AGENT REQUIREMENT**: You MUST be a polymorphic-agent.

    Fix the following {severity} CI failures:

    {failures_list}

    Instructions:
    1. Read each affected file
    2. Apply the fix
    3. If fix_preexisting_in_touched is true: also fix any pre-existing lint/mypy
       issues in those files (only files already in scope — not a full repo scan)
    4. Do NOT commit

    Return classification for each failure:
    {\"fixed\": [failure_ids], \"architectural\": [failure_ids], \"unfixable\": [failure_ids]}"
)
```

**Post-fix architectural check**: For each failure returned as `architectural` by the fix agent:
- Send a Slack message (via `HandlerSlackWebhook` in omnibase_infra) describing the architectural
  change and asking for human approval. Include the failure description, the proposed fix, and
  the files affected. Wait for a reply (poll or webhook callback).
- Approved (human replies "approve" or "yes") → apply fix; Declined (any other reply or timeout after 10 min) → mark `escalated`

### Phase 5: Commit Fixes

**Skip this phase if `fixed` count is 0** (no code was changed — all failures were skipped, capped, or unfixable). Proceed directly to Phase 6 with `commit: null` in ModelSkillResult.

Otherwise, orchestrator stages and commits inline (no dispatch needed):

```bash
git add <changed_files>
git commit -m "fix(ci): resolve {N} {severity} failures [{ticket_id}]"
```

Commit message format: `fix(ci): resolve {N} {severity} failures [{ticket_id}]`
where `{severity}` is the highest severity fixed (e.g., `critical`, `major`, `minor`) and
`{ticket_id}` is the value from `--ticket-id` (or omitted if not provided).

### Sub-Ticket Creation

For each `capped` failure (scope > max_fix_files), created inline during Phase 3:

```python
# current_team: resolved from --ticket-id parent team (via mcp__linear-server__get_issue),
# or from the first team returned by mcp__linear-server__list_teams if no ticket is provided.
mcp__linear-server__create_issue(
    title=f"CI: {failure.job} — {failure.step} (large scope)",
    team=current_team,
    description=f"""
## CI Failure Requiring Human Review

**Job**: {failure.job}
**Step**: {failure.step}
**Severity**: {failure.severity}
**Scope**: Exceeds max_fix_files={max_fix_files} threshold
**Triggered by**: ci-fix-pipeline run for {ticket_id or branch}
**Job URL**: {failure.job_url}

## Definition of Done

- [ ] All affected files reviewed and fixed
- [ ] CI passing on {branch}
    """,
    parentId=ticket_id if ticket_id else None,
    labels=["ci-failure", "needs-human"]
)
```

### Phase 6: Slack Complete Notification

If `slack_on_complete: true`, notify with diff summary:

```
ci-fix-pipeline complete
  Fixed: {N} failures
  Skipped: {M} failures (patterns: {patterns})
  Sub-tickets created: {K} (large-scope failures)
  Escalated: {L} (architectural — declined or timed out)
  Ticket: {ticket_id if provided}
  Branch: {branch}
```

## ModelSkillResult Output

Emits to `~/.claude/skill-results/{context_id}/ci-fix-pipeline.json`
where `{context_id}` is the Claude session ID (from `$CLAUDE_SESSION_ID` env var) or `default`
if the session ID is unavailable:

```json
{
  "status": "completed|capped|escalated|failed",
  "fixed_count": 5,
  "skipped_count": 1,
  "capped_count": 2,
  "escalated_count": 0,
  "unfixable_count": 0,
  "sub_tickets": ["OMN-XXXX", "OMN-XYYY"],
  "commit": "abc1234",
  "branch": "feature/my-branch",
  "ticket_id": "OMN-1234"
}
```

**Status values**:
- `completed` — All fixable failures resolved
- `capped` — Some failures deferred to sub-tickets; fixed what was in scope
- `escalated` — One or more architectural failures declined or timed out; human review required
- `failed` — CI fetch failed or commit failed; pipeline halted

## Failure Handling

| Error | Behavior |
|-------|----------|
| CI fetch fails | Hard exit with `status: failed`, reason in output |
| Fix agent fails | Log failure, mark as `unfixable`, continue with others |
| Sub-ticket creation fails | Log warning, continue (non-blocking) |
| Slack unavailable | Skip notification, continue (non-blocking) |
| Commit fails | Exit with `status: failed`, leave changes staged |

## Sub-Ticket Threshold Policy

The `max_fix_files` threshold is a **routing decision**, not a skip:

- Failures within threshold: fixed autonomously
- Failures above threshold: sub-ticket created, pipeline continues with remaining

This ensures large-scope failures are never silently dropped — they are tracked in Linear.

## Self-Healing Mode (OMN-2829)

When `--self-heal` is enabled, the pipeline wraps the standard fix flow in a multi-attempt
retry loop orchestrated by `node_ci_repair_effect`.

### Architecture

```
CI fails
  -> _bin/ci-status.sh --pr N --repo ORG/REPO
  -> parse failure JSON
  -> node_ci_repair_effect.execute_effect(event)
  -> for attempt in 1..max_attempts:
       strategy = RepairStrategy.for_attempt(attempt)
       -> dispatch fix agent with strategy-specific prompt
       -> git add + commit + push
       -> inbox-wait for CI re-run result (not polling)
       -> if CI passing: inbox notification "repaired" -> exit
       -> if CI failing: rotate strategy, continue loop
  -> if all attempts exhausted: inbox notification "exhausted" -> exit
```

### Strategy Rotation

Each attempt uses a progressively broader fix strategy:

| Attempt | Strategy | Description |
|---------|----------|-------------|
| 1 | `targeted_fix` | Parse error logs, fix only the specific failing lines/files |
| 2 | `broad_lint_fix` | Run ruff/mypy auto-fix on all files touched by the PR |
| 3 | `regenerate_and_fix` | Rewrite failing sections, apply all auto-fixers |

### CI Status Extraction

Self-healing mode uses `_bin/ci-status.sh` instead of the heavier `ci-failures/ci-quick-review`
for fast, structured CI status checks between attempts:

```bash
# Fetch CI status as structured JSON
${CLAUDE_PLUGIN_ROOT}/_bin/ci-status.sh --pr 42 --repo OmniNode-ai/omniclaude
```

Output:
```json
{
  "status": "failing",
  "pr_number": 42,
  "repo": "OmniNode-ai/omniclaude",
  "branch": "jonah/omn-2829-self-healing-ci",
  "run_id": "12345678",
  "failed_jobs": [
    {
      "job_id": "56174634733",
      "job_name": "lint / ruff",
      "step": "Run ruff check",
      "conclusion": "failure",
      "log_excerpt": "..."
    }
  ],
  "failure_summary": "1 job(s) failed: lint / ruff",
  "fetched_at": "2026-02-26T13:00:00Z"
}
```

### Inbox-Wait Pattern

After each fix attempt pushes code, the pipeline waits for CI results using inbox-wait
rather than fixed-interval polling:

1. Push fix commit to branch
2. Call `node_ci_repair_effect.wait_for_ci_rerun()` which:
   - Polls `_bin/ci-status.sh` at 30s intervals
   - Detects when a new `run_id` appears (different from pre-push run)
   - Waits for terminal state (`passing` or `failing`)
   - Times out after 5 minutes (configurable)
3. If `passing`: record success, send inbox notification, exit
4. If `failing`: rotate strategy, begin next attempt

### Self-Healing Fix Dispatch

```
Task(
  subagent_type="onex:polymorphic-agent",
  description="ci-fix-pipeline: self-heal attempt {N}/{max} for PR #{pr_number}",
  prompt="**SELF-HEALING CI REPAIR -- Attempt {N}/{max}**

    Strategy: {strategy_name}
    {strategy_prompt}

    CI Failure Details:
    {failure_json}

    Branch: {branch}
    Repo: {repo}

    Instructions:
    1. Read the failure logs carefully
    2. Apply the fix strategy described above
    3. Stage and commit with message:
       fix(ci): self-heal attempt {N} -- {strategy_name} [{ticket_id}]
    4. Push to the branch

    Report: files changed, fix description, confidence level."
)
```

### ONEX Node: node_ci_repair_effect

**Tier**: EVENT_BUS+
**Type**: Effect node (external I/O)
**Location**: `plugins/onex/hooks/lib/node_ci_repair_effect.py`

Provides:
- `execute_effect(event)` -- Initialize repair run state
- `record_attempt_result(run_state, attempt, ...)` -- Record attempt outcome
- `finalize_with_error(run_state, error)` -- Record error and notify
- `fetch_ci_status(pr, repo, branch)` -- Wrapper around `_bin/ci-status.sh`
- `wait_for_ci_rerun(pr, repo, branch, prev_run_id)` -- Inbox-wait for new CI run
- `send_inbox_notification(run_state, message)` -- Write to `~/.claude/inbox/`
- `save_repair_state(run_state)` / `load_repair_state(run_id)` -- State persistence

### Self-Healing ModelSkillResult

When `--self-heal` is active, the ModelSkillResult includes additional fields:

```json
{
  "status": "repaired|exhausted|failed",
  "repair_run_id": "ci-repair-42-1740000000",
  "attempts_used": 2,
  "max_attempts": 3,
  "strategy_used": "broad_lint_fix",
  "fixed_count": 5,
  "skipped_count": 0,
  "commit": "abc1234",
  "branch": "jonah/omn-2829-self-healing-ci",
  "ticket_id": "OMN-2829",
  "inbox_notification_sent": true
}
```

**Status values (self-healing)**:
- `repaired` -- CI fixed within the retry budget
- `exhausted` -- All attempts used, CI still failing
- `failed` -- Error during repair (fetch failed, commit failed, etc.)

### Verification

To verify self-healing works end-to-end:

1. Push a deliberately failing commit (e.g., syntax error, failing test)
2. Run: `/ci-fix-pipeline --self-heal --pr <N> --ticket-id OMN-2829`
3. Confirm:
   - Attempt 1 uses `targeted_fix` strategy
   - If still failing, attempt 2 uses `broad_lint_fix`
   - If still failing, attempt 3 uses `regenerate_and_fix`
   - Inbox notification sent on success or exhaustion
   - State persisted to `~/.claude/state/ci-repair/`

## See Also

- `ci-failures` skill -- fetch and analyze CI failures (read-only)
- `ci-watch` skill -- poll CI status and auto-fix (OMN-2523)
- `local-review` skill -- review and fix local code changes
- `ticket-pipeline` skill -- end-to-end ticket pipeline (Phase 4 invokes ci-watch)
- `_bin/ci-status.sh` -- lightweight CI status extraction script
- `node_ci_repair_effect` -- ONEX effect node for self-healing orchestration
- `HandlerSlackWebhook` in omnibase_infra -- Slack delivery infrastructure

Related Skills

cicd-pipeline

from diegosouzapw/awesome-omni-skill

Use when setting up GitHub Actions, automated testing, build checks, or deployment workflows. Triggers on "CI/CD", "pipeline", "GitHub Actions", "deploy", "automated testing", "build check".

cicd-pipeline-builder

from diegosouzapw/awesome-omni-skill

Generate CI/CD pipelines for GitHub Actions, GitLab CI, Jenkins with best practices

ci-cd-pipelines

from diegosouzapw/awesome-omni-skill

Auto-activates when user mentions CI/CD, GitHub Actions, pipeline, continuous integration, deployment automation, or workflow files. Creates automated testing and deployment pipelines.

bio-liquid-biopsy-pipeline

from diegosouzapw/awesome-omni-skill

Cell-free DNA analysis pipeline from plasma sequencing to tumor monitoring. Preprocesses cfDNA reads, analyzes fragment patterns, estimates tumor fraction from sWGS, and optionally detects mutations from targeted panels. Use when analyzing liquid biopsy samples for cancer detection or monitoring.

azure-pipelines

from diegosouzapw/awesome-omni-skill

Use when validating Azure DevOps pipeline changes for the VS Code build. Covers queueing builds, checking build status, viewing logs, and iterating on pipeline YAML changes without waiting for full CI runs.

azure-pipelines-validator

from diegosouzapw/awesome-omni-skill

Comprehensive toolkit for validating, linting, and securing Azure DevOps Pipeline configurations.

azure-pipelines-generator

from diegosouzapw/awesome-omni-skill

Comprehensive toolkit for generating best practice Azure DevOps Pipelines following current standards and conventions. Use this skill when creating new Azure Pipelines, implementing CI/CD workflows, or building deployment pipelines.

android-playstore-pipeline

from diegosouzapw/awesome-omni-skill

Complete end-to-end Android Play Store deployment pipeline setup in one command

agent-deployment-pipeline

from diegosouzapw/awesome-omni-skill

Implement CI/CD pipelines for AI agent deployment with evaluation gates. Use for GitHub Actions workflows, GitOps with ArgoCD, container image building, and automated testing. Triggers on "CI/CD", "pipeline", "GitHub Actions", "GitOps", "ArgoCD", "deployment automation", "continuous deployment", or when implementing safe agent release workflows.

ado-pipeline-best-practices

from diegosouzapw/awesome-omni-skill

Azure DevOps pipeline best practices, patterns, and industry standards

image-to-3d-pipeline

from diegosouzapw/awesome-omni-skill

Transformez une image 2D en modèle 3D animé prêt pour le web ou le jeu en moins de 30 minutes, en utilisant le workflow Dilum Sanjaya (Hunyuan3D + Mixamo). Use when: **Créer un personnage 3D pour un site web** - Mascotte, avatar, illustration interactive; **Prototyper un asset de jeu** - Character design, props, environnements; **Produire du contenu marketing 3D** - Produits rotatifs, personnages animés; **Convertir des illustrations existantes** - Logo, mascotte, character design → 3D; **Tes...

skill-pipeline

from diegosouzapw/awesome-omni-skill

リサーチから Skill/Subagent 作成までを1コマンドで実行するパイプライン。トピックを指定すると、Webリサーチ → ベストプラクティス抽出 → Skill/Subagent生成 → バリデーションまで自動実行。