writing-agent-relay-workflows
Use when building multi-agent workflows with the relay broker-sdk - covers the WorkflowBuilder API, DAG step dependencies, agent definitions, step output chaining via {{steps.X.output}}, verification gates, evidence-based completion, owner decisions, dedicated channels, dynamic channel management (subscribe/unsubscribe/mute/unmute), swarm patterns, error handling, event listeners, step sizing rules, authoring best practices, and the lead+workers team pattern for complex steps
Best use case
writing-agent-relay-workflows is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Use when building multi-agent workflows with the relay broker-sdk - covers the WorkflowBuilder API, DAG step dependencies, agent definitions, step output chaining via {{steps.X.output}}, verification gates, evidence-based completion, owner decisions, dedicated channels, dynamic channel management (subscribe/unsubscribe/mute/unmute), swarm patterns, error handling, event listeners, step sizing rules, authoring best practices, and the lead+workers team pattern for complex steps
Teams using writing-agent-relay-workflows should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/writing-agent-relay-workflows/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How writing-agent-relay-workflows Compares
| Feature / Agent | writing-agent-relay-workflows | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Use when building multi-agent workflows with the relay broker-sdk - covers the WorkflowBuilder API, DAG step dependencies, agent definitions, step output chaining via {{steps.X.output}}, verification gates, evidence-based completion, owner decisions, dedicated channels, dynamic channel management (subscribe/unsubscribe/mute/unmute), swarm patterns, error handling, event listeners, step sizing rules, authoring best practices, and the lead+workers team pattern for complex steps
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
SKILL.md Source
### Overview
The relay broker-sdk workflow system orchestrates multiple AI agents (Claude, Codex, Gemini, Aider, Goose) through typed DAG-based workflows. Workflows can be written in **TypeScript** (preferred), **Python**, or **YAML**.
**Language preference:** TypeScript > Python > YAML. Use TypeScript unless the project is Python-only or a simple config-driven workflow suits YAML.
**Pattern selection:** Do not default to `dag` blindly. If the job needs a different swarm/workflow type, consult the `choosing-swarm-patterns` skill when available and select the pattern that best matches the coordination problem.
### When to Use
- Building multi-agent workflows with step dependencies
- Orchestrating different AI CLIs (claude, codex, gemini, aider, goose)
- Creating DAG, pipeline, fan-out, or other swarm patterns
- Needing verification gates, retries, or step output chaining
- Dynamic channel management: agents joining/leaving/muting channels mid-workflow
### Quick Reference
#### ```typescript
```typescript
import { workflow } from '@agent-relay/sdk/workflows';
const result = await workflow('my-workflow')
.description('What this workflow does')
.pattern('dag') // or 'pipeline', 'fan-out', etc.
.channel('wf-my-workflow') // dedicated channel (auto-generated if omitted)
.maxConcurrency(3)
.timeout(3_600_000) // global timeout (ms)
.agent('lead', { cli: 'claude', role: 'Architect', retries: 2 })
.agent('worker', { cli: 'codex', role: 'Implementer', retries: 2 })
.step('plan', {
agent: 'lead',
task: `Analyze the codebase and produce a plan.`,
retries: 2,
verification: { type: 'output_contains', value: 'PLAN_COMPLETE' },
})
.step('implement', {
agent: 'worker',
task: `Implement based on this plan:\n{{steps.plan.output}}`,
dependsOn: ['plan'],
verification: { type: 'exit_code' },
})
.onError('retry', { maxRetries: 2, retryDelayMs: 10_000 })
.run({ cwd: process.cwd() });
console.log('Result:', result.status);
```
### ⚡ Parallelism — Design for Speed
#### Cross-Workflow Parallelism: Wave Planning
```bash
# BAD — sequential (14 hours for 27 workflows at ~30 min each)
agent-relay run workflows/34-sst-wiring.ts
agent-relay run workflows/35-env-config.ts
agent-relay run workflows/36-loading-states.ts
# ... one at a time
# GOOD — parallel waves (3-4 hours for 27 workflows)
# Wave 1: independent infra (parallel)
agent-relay run workflows/34-sst-wiring.ts &
agent-relay run workflows/35-env-config.ts &
agent-relay run workflows/36-loading-states.ts &
agent-relay run workflows/37-responsive.ts &
wait
git add -A && git commit -m "Wave 1"
# Wave 2: testing (parallel — independent test suites)
agent-relay run workflows/40-unit-tests.ts &
agent-relay run workflows/41-integration-tests.ts &
agent-relay run workflows/42-e2e-tests.ts &
wait
git add -A && git commit -m "Wave 2"
```
#### Declare File Scope for Planning
```typescript
workflow('48-comparison-mode')
.packages(['web', 'core']) // monorepo packages touched
.isolatedFrom(['49-feedback-system']) // explicitly safe to parallelize
.requiresBefore(['46-admin-dashboard']); // explicit ordering constraint
```
#### Within-Workflow Parallelism
```typescript
// BAD — unnecessary sequential chain
.step('fix-component-a', { agent: 'worker', dependsOn: ['review'] })
.step('fix-component-b', { agent: 'worker', dependsOn: ['fix-component-a'] }) // why wait?
// GOOD — parallel fan-out, merge at the end
.step('fix-component-a', { agent: 'impl-1', dependsOn: ['review'] })
.step('fix-component-b', { agent: 'impl-2', dependsOn: ['review'] }) // same dep = parallel
.step('verify-all', { agent: 'reviewer', dependsOn: ['fix-component-a', 'fix-component-b'] })
```
### Failure Prevention
#### 1. Do not use raw top-level `await`
```ts
async function runWorkflow() {
const result = await workflow('my-workflow')
// ...
.run({ cwd: process.cwd() });
console.log('Workflow status:', result.status);
}
runWorkflow().catch((error) => {
console.error(error);
process.exit(1);
});
```
#### 3. Keep final verification boring and deterministic
```bash
grep -Eq "foo|bar|baz" file.ts
```
#### 6. Be explicit about shell requirements
```bash
/opt/homebrew/bin/bash workflows/your-workflow/execute.sh --wave 2
```
### End-to-End Bug Fix Workflows
- **Capture the original failure**
- Reproduce the bug first in a deterministic or evidence-capturing step
- Save exact commands, logs, status codes, or screenshots/artifacts
- **State the acceptance contract**
- Define the exact end-to-end success criteria before implementation
- Include the real entrypoint a user would run
- **Implement the fix**
- **Rebuild / reinstall from scratch**
- Do not trust dirty local state
- Prefer a clean environment when install/bootstrap behavior is involved
- **Run targeted regression checks**
- Unit/integration tests are helpful but not sufficient by themselves
- **Run a full end-to-end validation**
- Use the real CLI / API / install path
- Prefer a clean environment (Docker, sandbox, cloud workspace, Daytona, etc.) for install/runtime issues
- **Compare before vs after evidence**
- Show that the original failure no longer occurs
- **Record residual risks**
- Call out what was not covered
- disposable sandbox / cloud workspace
- Docker / containerized environment
- fresh local shell with isolated paths
- compares candidate validation environments
- defines the acceptance contract
- chooses the best swarm pattern
- then authors the final fix/validation workflow
### Key Concepts
#### Verification Gates
```typescript
verification: { type: 'exit_code' } // preferred for code-editing steps
verification: { type: 'output_contains', value: 'DONE' } // optional accelerator
verification: { type: 'file_exists', value: 'src/out.ts' } // deterministic file check
```
#### DAG Dependencies
```typescript
.step('fix-types', { agent: 'worker', dependsOn: ['review'], ... })
.step('fix-tests', { agent: 'worker', dependsOn: ['review'], ... })
.step('final', { agent: 'lead', dependsOn: ['fix-types', 'fix-tests'], ... })
```
#### SDK API
```typescript
// Subscribe an agent to additional channels post-spawn
relay.subscribe({ agent: 'security-auditor', channels: ['review-pr-456'] });
// Unsubscribe — agent leaves the channel entirely
relay.unsubscribe({ agent: 'security-auditor', channels: ['general'] });
// Mute — agent stays subscribed (history access) but messages are NOT injected into PTY
relay.mute({ agent: 'security-auditor', channel: 'review-pr-123' });
// Unmute — resume PTY injection
relay.unmute({ agent: 'security-auditor', channel: 'review-pr-123' });
```
#### Events
```typescript
relay.onChannelSubscribed = (agent, channels) => {
/* ... */
};
relay.onChannelUnsubscribed = (agent, channels) => {
/* ... */
};
relay.onChannelMuted = (agent, channel) => {
/* ... */
};
relay.onChannelUnmuted = (agent, channel) => {
/* ... */
};
```
### Agent Definition
#### ```typescript
```typescript
.agent('name', {
cli: 'claude' | 'codex' | 'gemini' | 'aider' | 'goose' | 'opencode' | 'droid',
role?: string,
preset?: 'lead' | 'worker' | 'reviewer' | 'analyst',
retries?: number,
model?: string,
interactive?: boolean, // default: true
})
```
#### Model Constants
```typescript
import { ClaudeModels, CodexModels, GeminiModels } from '@agent-relay/config';
.agent('planner', { cli: 'claude', model: ClaudeModels.OPUS }) // not 'opus'
.agent('worker', { cli: 'claude', model: ClaudeModels.SONNET }) // not 'sonnet'
.agent('coder', { cli: 'codex', model: CodexModels.GPT_5_4 }) // not 'gpt-5.4'
```
### Step Definition
#### Agent Steps
```typescript
.step('name', {
agent: string,
task: string, // supports {{var}} and {{steps.NAME.output}}
dependsOn?: string[],
verification?: VerificationCheck,
retries?: number,
})
```
#### Deterministic Steps (Shell Commands)
```typescript
.step('verify-files', {
type: 'deterministic',
command: 'test -f src/auth.ts && echo "FILE_EXISTS"',
dependsOn: ['implement'],
captureOutput: true,
failOnError: true,
})
```
### Common Patterns
#### Interactive Team (lead + workers on shared channel)
```typescript
.agent('lead', {
cli: 'claude',
model: ClaudeModels.OPUS,
role: 'Architect and reviewer — assigns work, reviews, posts feedback',
retries: 1,
// No preset — interactive by default
})
.agent('impl-new', {
cli: 'codex',
model: CodexModels.O3,
role: 'Creates new files. Listens on channel for assignments and feedback.',
retries: 2,
// No preset — interactive, receives channel messages
})
.agent('impl-modify', {
cli: 'codex',
model: CodexModels.O3,
role: 'Edits existing files. Listens on channel for assignments and feedback.',
retries: 2,
})
// All three share the same dependsOn — they start concurrently (no deadlock)
.step('lead-coordinate', {
agent: 'lead',
dependsOn: ['context'],
task: `You are the lead on #channel. Workers: impl-new, impl-modify.
Post the plan. Assign files. Review their work. Post feedback if needed.
Workers iterate based on your feedback. Exit when all files are correct.`,
})
.step('impl-new-work', {
agent: 'impl-new',
dependsOn: ['context'], // same dep as lead = parallel start
task: `You are impl-new on #channel. Wait for the lead's plan.
Create files as assigned. Report completion. Fix issues from feedback.`,
})
.step('impl-modify-work', {
agent: 'impl-modify',
dependsOn: ['context'], // same dep as lead = parallel start
task: `You are impl-modify on #channel. Wait for the lead's plan.
Edit files as assigned. Report completion. Fix issues from feedback.`,
})
// Downstream gates on lead (lead exits when satisfied)
.step('verify', { type: 'deterministic', dependsOn: ['lead-coordinate'], ... })
```
#### Pipeline (sequential handoff)
```typescript
.pattern('pipeline')
.step('analyze', { agent: 'analyst', task: '...' })
.step('implement', { agent: 'dev', task: '{{steps.analyze.output}}', dependsOn: ['analyze'] })
.step('test', { agent: 'tester', task: '{{steps.implement.output}}', dependsOn: ['implement'] })
```
#### Error Handling
```typescript
.onError('fail-fast') // stop on first failure (default)
.onError('continue') // skip failed branches, continue others
.onError('retry', { maxRetries: 3, retryDelayMs: 5000 })
```
### Multi-File Edit Pattern
#### When a workflow needs to modify multiple existing files, **use one agent step per file** with a deterministic verify gate after each. Agents reliably edit 1-2 files per step but fail on 4+.
```yaml
steps:
- name: read-types
type: deterministic
command: cat src/types.ts
captureOutput: true
- name: edit-types
agent: dev
dependsOn: [read-types]
task: |
Edit src/types.ts. Current contents:
{{steps.read-types.output}}
Add 'pending' to the Status union type.
Only edit this one file.
verification:
type: exit_code
- name: verify-types
type: deterministic
dependsOn: [edit-types]
command: 'if git diff --quiet src/types.ts; then echo "NOT MODIFIED"; exit 1; fi; echo "OK"'
failOnError: true
- name: read-service
type: deterministic
dependsOn: [verify-types]
command: cat src/service.ts
captureOutput: true
- name: edit-service
agent: dev
dependsOn: [read-service]
task: |
Edit src/service.ts. Current contents:
{{steps.read-service.output}}
Add a handlePending() method.
Only edit this one file.
verification:
type: exit_code
- name: verify-service
type: deterministic
dependsOn: [edit-service]
command: 'if git diff --quiet src/service.ts; then echo "NOT MODIFIED"; exit 1; fi; echo "OK"'
failOnError: true
# Deterministic commit — never rely on agents to commit
- name: commit
type: deterministic
dependsOn: [verify-service]
command: git add src/types.ts src/service.ts && git commit -m "feat: add pending status"
failOnError: true
```
### File Materialization: Verify Before Proceeding
#### After any step that creates files, add a deterministic `file_exists` check before proceeding. Non-interactive agents may exit 0 without writing anything (wrong cwd, stdout instead of disk).
```yaml
- name: verify-files
type: deterministic
dependsOn: [impl-auth, impl-storage]
command: |
missing=0
for f in src/auth/credentials.ts src/storage/client.ts; do
if [ ! -f "$f" ]; then echo "MISSING: $f"; missing=$((missing+1)); fi
done
if [ $missing -gt 0 ]; then echo "$missing files missing"; exit 1; fi
echo "All files present"
failOnError: true
```
### DAG Deadlock Anti-Pattern
#### ```yaml
```yaml
# WRONG — deadlock: coordinate depends on context, work-a depends on coordinate
steps:
- name: coordinate
dependsOn: [context] # lead waits for WORKER_DONE...
- name: work-a
dependsOn: [coordinate] # ...but work-a can't start until coordinate finishes
# RIGHT — workers and lead start in parallel
steps:
- name: context
type: deterministic
- name: work-a
dependsOn: [context] # starts with lead
- name: coordinate
dependsOn: [context] # starts with workers
- name: merge
dependsOn: [work-a, coordinate]
```
### Step Sizing
#### **One agent, one deliverable.** A step's task prompt should be 10-20 lines max.
```yaml
# Team pattern: lead + workers on a shared channel
steps:
- name: track-lead-coord
agent: track-lead
dependsOn: [prior-step]
task: |
Lead the track on #my-track. Workers: track-worker-1, track-worker-2.
Post assignments to the channel. Review worker output.
- name: track-worker-1-impl
agent: track-worker-1
dependsOn: [prior-step] # same dep as lead — starts concurrently
task: |
Join #my-track. track-lead will post your assignment.
Implement the file as directed.
verification:
type: exit_code
- name: next-step
dependsOn: [track-lead-coord] # downstream depends on lead, not workers
```
### Supervisor Pattern
When you set `.pattern('supervisor')` (or `hub-spoke`, `fan-out`), the runner auto-assigns a supervisor agent as owner for worker steps. The supervisor monitors progress, nudges idle workers, and issues `OWNER_DECISION`.
**Auto-hardening only activates for hub patterns** — not `pipeline` or `dag`.
| Use case | Pattern | Why |
| ------------------------- | ------------------- | -------------------------------- |
| Sequential, no monitoring | `pipeline` | Simple, no overhead |
| Workers need oversight | `supervisor` | Auto-owner monitors |
| Local/small models | `supervisor` | Supervisor catches stuck workers |
| All non-interactive | `pipeline` or `dag` | No PTY = no supervision needed |
### Concurrency
**Cap `maxConcurrency` at 4-6.** Spawning 10+ agents simultaneously causes broker timeouts.
| Parallel agents | `maxConcurrency` |
| --------------- | ---------------- |
| 2-4 | 4 (default safe) |
| 5-10 | 5 |
| 10+ | 6-8 max |
### Common Mistakes
| Mistake | Fix |
| --------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| All workflows run sequentially | Group independent workflows into parallel waves (4-7x speedup) |
| Every step depends on the previous one | Only add `dependsOn` when there's a real data dependency |
| Self-review step with no timeout | Set `timeout: 300_000` (5 min) — Codex hangs in non-interactive review |
| One giant workflow per feature | Split into smaller workflows that can run in parallel waves |
| Adding exit instructions to tasks | Runner handles self-termination automatically |
| Setting `timeoutMs` on agents/steps | Use global `.timeout()` only |
| Using `general` channel | Set `.channel('wf-name')` for isolation |
| `{{steps.X.output}}` without `dependsOn: ['X']` | Output won't be available yet |
| Requiring exact sentinel as only completion gate | Use `exit_code` or `file_exists` verification |
| Writing 100-line task prompts | Split into lead + workers on a channel |
| `maxConcurrency: 16` with many parallel steps | Cap at 5-6 |
| Non-interactive agent reading large files via tools | Pre-read in deterministic step, inject via `{{steps.X.output}}` |
| Workers depending on lead step (deadlock) | Both depend on shared context step |
| `fan-out`/`hub-spoke` for simple parallel workers | Use `dag` instead |
| `pipeline` but expecting auto-supervisor | Only hub patterns auto-harden. Use `.pattern('supervisor')` |
| Workers without `preset: 'worker'` in one-shot DAG lead+worker flows | Add preset for clean stdout when chaining `{{steps.X.output}}` (not needed for interactive team patterns) |
| Using `_` in YAML numbers (`timeoutMs: 1_200_000`) | YAML doesn't support `_` separators |
| Workflow timeout under 30 min for complex workflows | Use `3600000` (1 hour) as default |
| Using `require()` in ESM projects | Check `package.json` for `"type": "module"` — use `import` if ESM |
| Wrapping in `async function main()` in ESM | ESM supports top-level `await` — no wrapper needed |
| Using `createWorkflowRenderer` | Does not exist. Use `.run({ cwd: process.cwd() })` |
| `export default workflow(...)...build()` | No `.build()`. Chain ends with `.run()` — the file must call `.run()`, not just export config |
| Relative import `'../workflows/builder.js'` | Use `import { workflow } from '@agent-relay/sdk/workflows'` |
| Hardcoded model strings (`model: 'opus'`) | Use constants: `import { ClaudeModels } from '@agent-relay/config'` → `model: ClaudeModels.OPUS` |
| Thinking `agent-relay run` inspects exports | It executes the file as a subprocess. Only `.run()` invocations trigger steps |
| `pattern('single')` on cloud runner | Not supported — use `dag` |
| `pattern('supervisor')` with one agent | Same agent is owner + specialist. Use `dag` |
| Invalid verification type (`type: 'deterministic'`) | Only `exit_code`, `output_contains`, `file_exists`, `custom` are valid |
| Chaining `{{steps.X.output}}` from interactive agents | PTY output is garbled. Use deterministic steps or `preset: 'worker'` |
| Single step editing 4+ files | Agents modify 1-2 then exit. Split to one file per step with verify gates |
| Relying on agents to `git commit` | Agents emit markers without running git. Use deterministic commit step |
| File-writing steps without `file_exists` verification | `exit_code` auto-passes even if no file written |
| Manual peer fanout in `handleChannelMessage()` | Use broker-managed channel subscriptions — broker fans out to all subscribers automatically |
| Client-side `personaNames.has(from)` filtering | Use `relay.subscribe()`/`relay.unsubscribe()` — only subscribed agents receive messages |
| Agents receiving noisy cross-channel messages during focused work | Use `relay.mute({ agent, channel })` to silence non-primary channels without leaving them |
| Hardcoding all channels at spawn time | Use `agent.subscribe()` / `agent.unsubscribe()` for dynamic channel membership post-spawn |
| Using `preset: 'worker'` for Codex in _interactive team_ patterns when coordination is needed | Codex interactive mode works fine with PTY channel injection. Drop the preset for interactive team patterns (keep it for one-shot DAG workers where clean stdout matters) |
| Separate reviewer agent from lead in interactive team | Merge lead + reviewer into one interactive Claude agent — reviews between rounds, fewer agents |
| Not printing PR URL after `gh pr create` | Add a final deterministic step: `echo "PR: $(cat pr-url.txt)"` or capture in the `gh pr create` command |
| Workflow ending without worktree + PR for cross-repo changes | Add `setup-worktree` at start and `push-and-pr` + `cleanup-worktree` at end |
### YAML Alternative
#### ```yaml
```yaml
version: '1.0'
name: my-workflow
swarm:
pattern: dag
channel: wf-my-workflow
agents:
- name: lead
cli: claude
role: Architect
- name: worker
cli: codex
role: Implementer
workflows:
- name: default
steps:
- name: plan
agent: lead
task: 'Produce a detailed implementation plan.'
- name: implement
agent: worker
task: 'Implement: {{steps.plan.output}}'
dependsOn: [plan]
verification:
type: exit_code
```
### Available Swarm Patterns
`dag` (default), `fan-out`, `pipeline`, `hub-spoke`, `consensus`, `mesh`, `handoff`, `cascade`, `debate`, `hierarchical`, `map-reduce`, `scatter-gather`, `supervisor`, `reflection`, `red-team`, `verifier`, `auction`, `escalation`, `saga`, `circuit-breaker`, `blackboard`, `swarm`
See skill `choosing-swarm-patterns` for pattern selection guidance.Related Skills
agent-relay-orchestrator
Run headless multi-agent orchestration sessions via Agent Relay. Use when spawning teams of agents, creating channels for coordination, managing agent lifecycle, and running parallel workloads across Claude/Codex/Gemini/Pi/Droid agents.
agent-relay
Use when you need Codex to coordinate multiple agents through Relaycast for peer-to-peer messaging, lead/worker handoffs, or shared status tracking across sub-agents and terminals.
openclaw-relay
Real-time messaging across OpenClaw instances (channels, DMs, threads, reactions, search).
using-agent-relay
Use when coordinating multiple AI agents in real-time - provides inter-agent messaging via MCP tools
adding-swarm-patterns
Use when adding new multi-agent coordination patterns to agent-relay - provides checklist for types, schema, templates, and docs updates
prpm-json-best-practices
Best practices for structuring prpm.json package manifests with required fields, tags, organization, multi-package management, enhanced file format, eager/lazy activation, and conversion hints
implementing-command-palettes
Use when building Cmd+K command palettes in React - covers keyboard navigation with arrow keys, keeping selected items in view with scrollIntoView, filtering with shortcut matching, and preventing infinite re-renders from reference instability
github-oauth-nango-integration
Use when implementing GitHub OAuth + GitHub App authentication with Nango - provides two-connection pattern for user login and repo access with webhook handling
frontend-design
Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
deploying-to-staging-environment
Use when deploying changes to staging across relay, relay-dashboard, and relay-cloud repos - coordinates multi-repo branch syncing using git worktrees, automatically triggers staging deployments via GitHub Actions
debugging-websocket-issues
Use when seeing WebSocket errors like "Invalid frame header", "RSV1 must be clear", or "WS_ERR_UNEXPECTED_RSV_1" - covers multiple WebSocketServer conflicts, compression issues, and raw frame debugging techniques
creating-skills
Use when creating new Claude Code skills or improving existing ones - ensures skills are discoverable, scannable, and effective through proper structure, CSO optimization, and real examples