enterprise-agent-ops

通过可观测性、安全边界和生命周期管理来操作长期运行的代理工作负载。

144,923 stars

Complexity: easy

About this skill

This skill provides AI agents with a comprehensive framework for operating cloud-hosted or continuously running agent systems that demand robust control beyond simple CLI sessions. It outlines key operational domains including runtime lifecycle management (start, pause, stop, restart), observability (logs, metrics, tracing), security controls (scope, permissions, emergency stops), and change management (releases, rollbacks, auditing). The skill also details essential baseline controls such as immutable deployments, least privilege, environment-level secret injection, hard timeouts, retry budgets, and audit logging for high-risk operations. Furthermore, it guides the agent on crucial metrics to track (success rate, recovery time, cost per task) and provides structured incident handling patterns for managing system failures effectively. This skill allows an AI agent to reason about, plan, and guide the implementation of production-ready operational strategies.

Best use case

1. **Designing Operational Frameworks**: When an AI agent is tasked with planning the operational setup for a new AI application, ensuring robustness, security, and scalability. 2. **Incident Response Guidance**: Assisting human operators or other automated systems during critical incidents by suggesting steps based on established incident handling patterns. 3. **Auditing and Compliance**: Helping an agent identify areas of non-compliance with operational best practices or suggesting improvements to existing setups. 4. **Evaluating Agent Systems**: When an agent needs to assess the operational maturity or readiness of an existing AI system.

通过可观测性、安全边界和生命周期管理来操作长期运行的代理工作负载。

An AI agent capable of articulating, planning, and evaluating enterprise-grade operational strategies for AI agent systems, leading to more resilient, secure, and cost-efficient deployments. The agent will demonstrate awareness of best practices for lifecycle, security, observability, and incident management.

Practical example

Example input

I need to ensure our new AI agent for financial transactions is robust and secure. What are the key operational areas and controls I should consider?

Example output

```
To ensure your AI agent for financial transactions is robust and secure, we must focus on enterprise-grade operational practices, guided by the 'enterprise-agent-ops' skill:

**1. Operational Domains:**
* **Runtime Lifecycle:** Implement robust mechanisms for `start`, `pause`, `stop`, and `restart` with automated recovery strategies.
* **Observability:** Set up comprehensive `logging` (structured, searchable), `metrics` (success rate, latency, resource usage, financial cost per transaction), and `tracing` for transaction paths.
* **Security Controls:** Define strict `scope` and `permissions` based on least privilege. Implement an `emergency stop switch` for critical situations.
* **Change Management:** Establish procedures for `releases` (controlled rollouts), `rollbacks` (fast recovery), and detailed `auditing` of all changes.

**2. Baseline Controls:**
* **Immutable Deployments:** Deploy agent artifacts that are not modified after deployment.
* **Minimal Privilege Credentials:** Ensure the agent and its services operate with the absolute minimum necessary permissions.
* **Environment-level Secret Injection:** Inject sensitive credentials (e.g., API keys for financial services) securely at runtime, not hardcoded.
* **Hard Timeouts & Retry Budgets:** Define strict limits for task execution and retry attempts to prevent runaway processes or resource exhaustion.
* **Audit Logs for High-Risk Operations:** Every financial transaction or sensitive operation must be meticulously logged for auditing purposes.

**3. Metrics to Track:**
* **Success Rate:** Percentage of successful transactions.
* **Average Retries per Task:** Identify flaky operations.
* **Recovery Time:** How quickly the agent recovers from failures.
* **Cost per Successful Task:** Monitor operational efficiency, especially for cloud resources.
* **Failure Category Distribution:** Understand common failure modes (e.g., network, API, data errors).

**4. Incident Handling Pattern (when failures surge):**
1. **Freeze New Releases:** Prevent further potential issues.
2. **Capture Representative Trace Data:** Gather diagnostics.
3. **Isolate Fault Path:** Pinpoint the source of the problem.
4. **Apply Minimal Safe Patch:** Deploy targeted fixes, prioritizing security.
5. **Run Regression Tests + Security Checks:** Verify the fix and prevent regressions.
6. **Gradually Restore:** Bring services back online carefully.

Consider integrating with tools like container orchestrators for deployment and monitoring, and leverage CI/CD for automated change management.
```

When to use this skill

Use this skill when an AI agent needs to understand, apply, or reason about enterprise-level operational principles for managing AI agent deployments. This is particularly relevant for long-running, mission-critical, or cloud-hosted agent workloads where reliability, security, and maintainability are paramount. It's ideal for tasks involving strategic planning, system design, or post-deployment operational analysis.

When not to use this skill

This skill is not for direct execution of system commands like `start_agent()` or `get_logs()`. Instead, it provides the *intelligence and guidelines* for *how* such commands should be invoked or *what operational considerations* should drive their use. It doesn't replace specific tool-calling skills for interacting with PM2, systemd, or container orchestrators, but rather informs their intelligent application.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/enterprise-agent-ops/SKILL.md --create-dirs "https://raw.githubusercontent.com/affaan-m/everything-claude-code/main/docs/zh-CN/skills/enterprise-agent-ops/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/enterprise-agent-ops/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How enterprise-agent-ops Compares

Feature / Agent	enterprise-agent-ops	Standard Approach
Platform Support	Claude	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	easy	N/A

Frequently Asked Questions

What does this skill do?

通过可观测性、安全边界和生命周期管理来操作长期运行的代理工作负载。

Which AI agents support this skill?

This skill is designed for Claude.

How difficult is it to install?

The installation complexity is rated as easy. You can find the installation instructions above.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

Top AI Agents for Productivity

See the top AI agent skills for productivity, workflow automation, operational systems, documentation, and everyday task execution.

AI Agents for Coding

Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.

SKILL.md Source

# 企业级智能体运维

使用此技能用于需要超越单次 CLI 会话操作控制的云托管或持续运行的智能体系统。

## 运维领域

1. 运行时生命周期（启动、暂停、停止、重启）
2. 可观测性（日志、指标、追踪）
3. 安全控制（作用域、权限、紧急停止开关）
4. 变更管理（发布、回滚、审计）

## 基线控制

* 不可变的部署工件
* 最小权限凭证
* 环境级别的密钥注入
* 硬性超时和重试预算
* 高风险操作的审计日志

## 需跟踪的指标

* 成功率
* 每项任务的平均重试次数
* 恢复时间
* 每项成功任务的成本
* 故障类别分布

## 事故处理模式

当故障激增时：

1. 冻结新发布
2. 捕获代表性追踪数据
3. 隔离故障路径
4. 应用最小的安全变更进行修补
5. 运行回归测试 + 安全检查
6. 逐步恢复

## 部署集成

此技能可与以下工具配合使用：

* PM2 工作流
* systemd 服务
* 容器编排器
* CI/CD 门控

DevelopmentClaude