enterprise-agent-ops

通过可观测性、安全边界和生命周期管理来操作长期运行的代理工作负载。

144,923 stars
Complexity: easy

About this skill

This skill provides AI agents with a comprehensive framework for operating cloud-hosted or continuously running agent systems that demand robust control beyond simple CLI sessions. It outlines key operational domains including runtime lifecycle management (start, pause, stop, restart), observability (logs, metrics, tracing), security controls (scope, permissions, emergency stops), and change management (releases, rollbacks, auditing). The skill also details essential baseline controls such as immutable deployments, least privilege, environment-level secret injection, hard timeouts, retry budgets, and audit logging for high-risk operations. Furthermore, it guides the agent on crucial metrics to track (success rate, recovery time, cost per task) and provides structured incident handling patterns for managing system failures effectively. This skill allows an AI agent to reason about, plan, and guide the implementation of production-ready operational strategies.

Best use case

1. **Designing Operational Frameworks**: When an AI agent is tasked with planning the operational setup for a new AI application, ensuring robustness, security, and scalability. 2. **Incident Response Guidance**: Assisting human operators or other automated systems during critical incidents by suggesting steps based on established incident handling patterns. 3. **Auditing and Compliance**: Helping an agent identify areas of non-compliance with operational best practices or suggesting improvements to existing setups. 4. **Evaluating Agent Systems**: When an agent needs to assess the operational maturity or readiness of an existing AI system.

通过可观测性、安全边界和生命周期管理来操作长期运行的代理工作负载。

An AI agent capable of articulating, planning, and evaluating enterprise-grade operational strategies for AI agent systems, leading to more resilient, secure, and cost-efficient deployments. The agent will demonstrate awareness of best practices for lifecycle, security, observability, and incident management.

Practical example

Example input

I need to ensure our new AI agent for financial transactions is robust and secure. What are the key operational areas and controls I should consider?

Example output

```
To ensure your AI agent for financial transactions is robust and secure, we must focus on enterprise-grade operational practices, guided by the 'enterprise-agent-ops' skill:

**1. Operational Domains:**
*   **Runtime Lifecycle:** Implement robust mechanisms for `start`, `pause`, `stop`, and `restart` with automated recovery strategies.
*   **Observability:** Set up comprehensive `logging` (structured, searchable), `metrics` (success rate, latency, resource usage, financial cost per transaction), and `tracing` for transaction paths.
*   **Security Controls:** Define strict `scope` and `permissions` based on least privilege. Implement an `emergency stop switch` for critical situations.
*   **Change Management:** Establish procedures for `releases` (controlled rollouts), `rollbacks` (fast recovery), and detailed `auditing` of all changes.

**2. Baseline Controls:**
*   **Immutable Deployments:** Deploy agent artifacts that are not modified after deployment.
*   **Minimal Privilege Credentials:** Ensure the agent and its services operate with the absolute minimum necessary permissions.
*   **Environment-level Secret Injection:** Inject sensitive credentials (e.g., API keys for financial services) securely at runtime, not hardcoded.
*   **Hard Timeouts & Retry Budgets:** Define strict limits for task execution and retry attempts to prevent runaway processes or resource exhaustion.
*   **Audit Logs for High-Risk Operations:** Every financial transaction or sensitive operation must be meticulously logged for auditing purposes.

**3. Metrics to Track:**
*   **Success Rate:** Percentage of successful transactions.
*   **Average Retries per Task:** Identify flaky operations.
*   **Recovery Time:** How quickly the agent recovers from failures.
*   **Cost per Successful Task:** Monitor operational efficiency, especially for cloud resources.
*   **Failure Category Distribution:** Understand common failure modes (e.g., network, API, data errors).

**4. Incident Handling Pattern (when failures surge):**
1.  **Freeze New Releases:** Prevent further potential issues.
2.  **Capture Representative Trace Data:** Gather diagnostics.
3.  **Isolate Fault Path:** Pinpoint the source of the problem.
4.  **Apply Minimal Safe Patch:** Deploy targeted fixes, prioritizing security.
5.  **Run Regression Tests + Security Checks:** Verify the fix and prevent regressions.
6.  **Gradually Restore:** Bring services back online carefully.

Consider integrating with tools like container orchestrators for deployment and monitoring, and leverage CI/CD for automated change management.
```

When to use this skill

  • Use this skill when an AI agent needs to understand, apply, or reason about enterprise-level operational principles for managing AI agent deployments. This is particularly relevant for long-running, mission-critical, or cloud-hosted agent workloads where reliability, security, and maintainability are paramount. It's ideal for tasks involving strategic planning, system design, or post-deployment operational analysis.

When not to use this skill

  • This skill is not for direct execution of system commands like `start_agent()` or `get_logs()`. Instead, it provides the *intelligence and guidelines* for *how* such commands should be invoked or *what operational considerations* should drive their use. It doesn't replace specific tool-calling skills for interacting with PM2, systemd, or container orchestrators, but rather informs their intelligent application.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/enterprise-agent-ops/SKILL.md --create-dirs "https://raw.githubusercontent.com/affaan-m/everything-claude-code/main/docs/zh-CN/skills/enterprise-agent-ops/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/enterprise-agent-ops/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How enterprise-agent-ops Compares

Feature / Agententerprise-agent-opsStandard Approach
Platform SupportClaudeLimited / Varies
Context Awareness High Baseline
Installation ComplexityeasyN/A

Frequently Asked Questions

What does this skill do?

通过可观测性、安全边界和生命周期管理来操作长期运行的代理工作负载。

Which AI agents support this skill?

This skill is designed for Claude.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# 企业级智能体运维

使用此技能用于需要超越单次 CLI 会话操作控制的云托管或持续运行的智能体系统。

## 运维领域

1. 运行时生命周期(启动、暂停、停止、重启)
2. 可观测性(日志、指标、追踪)
3. 安全控制(作用域、权限、紧急停止开关)
4. 变更管理(发布、回滚、审计)

## 基线控制

* 不可变的部署工件
* 最小权限凭证
* 环境级别的密钥注入
* 硬性超时和重试预算
* 高风险操作的审计日志

## 需跟踪的指标

* 成功率
* 每项任务的平均重试次数
* 恢复时间
* 每项成功任务的成本
* 故障类别分布

## 事故处理模式

当故障激增时:

1. 冻结新发布
2. 捕获代表性追踪数据
3. 隔离故障路径
4. 应用最小的安全变更进行修补
5. 运行回归测试 + 安全检查
6. 逐步恢复

## 部署集成

此技能可与以下工具配合使用:

* PM2 工作流
* systemd 服务
* 容器编排器
* CI/CD 门控

Related Skills

workspace-surface-audit

144923
from affaan-m/everything-claude-code

Audit the active repo, MCP servers, plugins, connectors, env surfaces, and harness setup, then recommend the highest-value ECC-native skills, hooks, agents, and operator workflows. Use when the user wants help setting up Claude Code or understanding what capabilities are actually available in their environment.

DevelopmentClaude

ui-demo

144923
from affaan-m/everything-claude-code

Record polished UI demo videos using Playwright. Use when the user asks to create a demo, walkthrough, screen recording, or tutorial video of a web application. Produces WebM videos with visible cursor, natural pacing, and professional feel.

Developer ToolsClaude

token-budget-advisor

144923
from affaan-m/everything-claude-code

Offers the user an informed choice about how much response depth to consume before answering. Use this skill when the user explicitly wants to control response length, depth, or token budget. TRIGGER when: "token budget", "token count", "token usage", "token limit", "response length", "answer depth", "short version", "brief answer", "detailed answer", "exhaustive answer", "respuesta corta vs larga", "cuántos tokens", "ahorrar tokens", "responde al 50%", "dame la versión corta", "quiero controlar cuánto usas", or clear variants where the user is explicitly asking to control answer size or depth. DO NOT TRIGGER when: user has already specified a level in the current session (maintain it), the request is clearly a one-word answer, or "token" refers to auth/session/payment tokens rather than response size.

Productivity & Content CreationClaude

skill-comply

144923
from affaan-m/everything-claude-code

Visualize whether skills, rules, and agent definitions are actually followed — auto-generates scenarios at 3 prompt strictness levels, runs agents, classifies behavioral sequences, and reports compliance rates with full tool call timelines

DevelopmentClaude

santa-method

144923
from affaan-m/everything-claude-code

Multi-agent adversarial verification with convergence loop. Two independent review agents must both pass before output ships.

Quality AssuranceClaude

safety-guard

144923
from affaan-m/everything-claude-code

Use this skill to prevent destructive operations when working on production systems or running agents autonomously.

DevelopmentClaude

repo-scan

144923
from affaan-m/everything-claude-code

Cross-stack source code asset audit — classifies every file, detects embedded third-party libraries, and delivers actionable four-level verdicts per module with interactive HTML reports.

DevelopmentClaude

project-flow-ops

144923
from affaan-m/everything-claude-code

Operate execution flow across GitHub and Linear by triaging issues and pull requests, linking active work, and keeping GitHub public-facing while Linear remains the internal execution layer. Use when the user wants backlog control, PR triage, or GitHub-to-Linear coordination.

DevelopmentClaude

product-lens

144923
from affaan-m/everything-claude-code

Use this skill to validate the "why" before building, run product diagnostics, and pressure-test product direction before the request becomes an implementation contract.

Product ManagementClaude

openclaw-persona-forge

144923
from affaan-m/everything-claude-code

为 OpenClaw AI Agent 锻造完整的龙虾灵魂方案。根据用户偏好或随机抽卡, 输出身份定位、灵魂描述(SOUL.md)、角色化底线规则、名字和头像生图提示词。 如当前环境提供已审核的生图 skill,可自动生成统一风格头像图片。 当用户需要创建、设计或定制 OpenClaw 龙虾灵魂时使用。 不适用于:微调已有 SOUL.md、非 OpenClaw 平台的角色设计、纯工具型无性格 Agent。 触发词:龙虾灵魂、虾魂、OpenClaw 灵魂、养虾灵魂、龙虾角色、龙虾定位、 龙虾剧本杀角色、龙虾游戏角色、龙虾 NPC、龙虾性格、龙虾背景故事、 lobster soul、lobster character、抽卡、随机龙虾、龙虾 SOUL、gacha。

AI Tools & UtilitiesClaude

manim-video

144923
from affaan-m/everything-claude-code

Build reusable Manim explainers for technical concepts, graphs, system diagrams, and product walkthroughs, then hand off to the wider ECC video stack if needed. Use when the user wants a clean animated explainer rather than a generic talking-head script.

DevelopmentClaude

laravel-plugin-discovery

144923
from affaan-m/everything-claude-code

Discover and evaluate Laravel packages via LaraPlugins.io MCP. Use when the user wants to find plugins, check package health, or assess Laravel/PHP compatibility.

DevelopmentClaude