check-ceph-health

Check Ceph storage health on OpenShift OCS/ODF clusters. Use when PVCs are stuck in Pending, storage provisioning fails, Ceph is degraded, OSDs are full, or cluster storage needs diagnosis.

16 stars

Best use case

check-ceph-health is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Check Ceph storage health on OpenShift OCS/ODF clusters. Use when PVCs are stuck in Pending, storage provisioning fails, Ceph is degraded, OSDs are full, or cluster storage needs diagnosis.

Teams using check-ceph-health should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/check-ceph-health/SKILL.md --create-dirs "https://raw.githubusercontent.com/diegosouzapw/awesome-omni-skill/main/skills/data-ai/check-ceph-health/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/check-ceph-health/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How check-ceph-health Compares

Feature / Agentcheck-ceph-healthStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Check Ceph storage health on OpenShift OCS/ODF clusters. Use when PVCs are stuck in Pending, storage provisioning fails, Ceph is degraded, OSDs are full, or cluster storage needs diagnosis.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Check Ceph Health

Use this guide to diagnose and remediate Ceph storage issues on OpenShift clusters running OCS/ODF (OpenShift Data Foundation).

## 1. Ceph Cluster Health

```bash
# Quick health status
kubectl -n openshift-storage get cephcluster -o jsonpath='{.items[*].status.ceph.health}'

# Detailed health with error messages
kubectl -n openshift-storage get cephcluster -o jsonpath='{.items[*].status.ceph.details}' | python3 -m json.tool

# Capacity overview (bytesAvailable, bytesUsed, bytesTotal)
kubectl -n openshift-storage get cephcluster -o jsonpath='{.items[*].status.ceph.capacity}' | python3 -m json.tool
```

Health states:
- `HEALTH_OK` -- cluster is healthy
- `HEALTH_WARN` -- degraded but functional (backfillfull, nearfull, degraded PGs)
- `HEALTH_ERR` -- critical, writes may be blocked (full OSDs, too few OSDs, down PGs)

## 2. Running Ceph Commands

OCS/ODF clusters may not have a rook-ceph-tools pod deployed. Use a mon pod to run ceph commands directly.

```bash
# Find the mon pod and its service address
MON_POD=$(kubectl -n openshift-storage get pods -l app=rook-ceph-mon -o jsonpath='{.items[0].metadata.name}')
MON_ADDR=$(kubectl -n openshift-storage get pod $MON_POD -o jsonpath='{.spec.containers[0].env[?(@.name=="ROOK_CEPH_MON_HOST")].value}' | sed 's/\[//;s/\]//')

# Run any ceph command via the mon pod
kubectl -n openshift-storage exec $MON_POD -c mon -- \
  ceph -m $MON_ADDR --keyring /etc/ceph/keyring-store/keyring status
```

Useful ceph commands to run this way:
- `status` -- overall cluster status
- `osd df` -- per-OSD disk usage
- `osd pool ls detail` -- pool details
- `df` -- pool-level capacity
- `health detail` -- verbose health messages

## 3. OSD Status

```bash
# OSD pods
kubectl -n openshift-storage get pods -l app=rook-ceph-osd

# OSD prepare jobs (should be Completed, not stuck)
kubectl -n openshift-storage get pods | grep osd-prepare

# Storage device sets (backing PVCs for OSDs)
kubectl -n openshift-storage get pvc -l app=rook-ceph-osd
```

## 4. CSI Provisioner Pods

PVC provisioning is handled by CSI driver pods. If these are unhealthy, no volumes can be created.

```bash
# RBD CSI controller (provisions rbd volumes)
kubectl -n openshift-storage get pods | grep rbd.*ctrlplugin

# CephFS CSI controller (provisions cephfs volumes)
kubectl -n openshift-storage get pods | grep cephfs.*ctrlplugin

# RBD node plugins (mount volumes on nodes)
kubectl -n openshift-storage get pods | grep rbd.*nodeplugin

# Check for CSI provisioner errors in logs
kubectl -n openshift-storage logs <rbd-ctrlplugin-pod> -c csi-rbdplugin --tail=50
```

## 5. PVC and PV Diagnosis

```bash
# Find stuck PVCs
kubectl get pvc --all-namespaces --field-selector status.phase=Pending

# Describe a pending PVC to see provisioning errors
kubectl describe pvc <pvc-name> -n <namespace>

# Find Released PVs (consume space but no longer bound to a PVC)
kubectl get pv --field-selector status.phase=Released

# Check StorageClasses
kubectl get storageclass
```

## 6. Common Problems and Remediation

### OSDs Full (HEALTH_ERR: full osd(s))

**Symptoms**: PVCs stuck in Pending, provisioning errors with `DeadlineExceeded` or `operation already exists`.

**Diagnosis**:
```bash
kubectl -n openshift-storage get cephcluster -o jsonpath='{.items[*].status.ceph.details}' | python3 -m json.tool
```
Look for `OSD_FULL` and `POOL_FULL` messages.

**Remediation**:

1. **Delete Released PVs** to reclaim space from orphaned volumes:
   ```bash
   kubectl get pv --field-selector status.phase=Released
   kubectl delete pv <released-pv-names>
   ```

2. **Temporarily raise the full ratio** if Ceph is blocking all writes (including deletes):
   ```bash
   # Raise to 0.92 to unblock writes temporarily
   kubectl -n openshift-storage exec $MON_POD -c mon -- \
     ceph -m $MON_ADDR --keyring /etc/ceph/keyring-store/keyring \
     osd set-full-ratio 0.92
   ```
   Once space is freed and health improves, **reset to default**:
   ```bash
   kubectl -n openshift-storage exec $MON_POD -c mon -- \
     ceph -m $MON_ADDR --keyring /etc/ceph/keyring-store/keyring \
     osd set-full-ratio 0.85
   ```

3. **Add more storage** by expanding OSD count or disk size if cleanup is insufficient.

### OSDs Nearfull / Backfillfull (HEALTH_WARN)

**Symptoms**: Cluster functional but approaching full. Warnings about `nearfull` or `backfillfull` OSDs.

**Remediation**:
- Clean up unused PVCs and Released PVs
- Delete completed migration data no longer needed
- Plan capacity expansion before reaching full threshold (85%)

### Degraded PGs

**Symptoms**: `HEALTH_WARN` with messages about degraded or undersized placement groups.

**Diagnosis**:
```bash
# Via mon pod:
ceph health detail
ceph pg stat
```

**Remediation**:
- If an OSD is down, check the OSD pod and its node
- If a node is down, Ceph will self-heal once the node returns
- If an OSD is permanently lost, Ceph will rebalance automatically (may take time)

### CSI Provisioner Not Responding

**Symptoms**: PVC events say "waiting for external provisioner" but no `ProvisioningFailed` errors.

**Diagnosis**:
```bash
kubectl -n openshift-storage get pods | grep ctrlplugin
kubectl -n openshift-storage logs <rbd-ctrlplugin-pod> -c csi-rbdplugin --tail=100
```

**Remediation**:
- Restart the CSI controller pod if it's stuck
- Check if the Ceph cluster is reachable from the CSI pod
- Verify the StorageClass references a valid pool and secret

### Pools Full but OSDs Not Full

**Symptoms**: `POOL_FULL` warning but individual OSDs have space.

**Diagnosis**:
```bash
# Via mon pod:
ceph osd pool ls detail
ceph df detail
```

**Remediation**:
- A pool may have a quota set -- check and raise it
- Rebalance may be needed if data is unevenly distributed

## 7. Operator Health

```bash
# OCS/ODF operator pods
kubectl -n openshift-storage get pods | grep -E 'ocs-operator|odf-operator|rook-ceph-operator'

# Rook operator logs (manages Ceph cluster lifecycle)
kubectl -n openshift-storage logs deployment/rook-ceph-operator --tail=50

# Check for CrashLoopBackOff or restarts
kubectl -n openshift-storage get pods -o custom-columns=\
'NAME:.metadata.name,STATUS:.status.phase,RESTARTS:.status.containerStatuses[0].restartCount' \
  | sort -t' ' -k3 -rn | head -10
```

## 8. Preventive Checks

Run these periodically to avoid surprise outages:

```bash
# Capacity usage percentage
kubectl -n openshift-storage get cephcluster -o jsonpath='{.items[*].status.ceph.capacity}' | \
  python3 -c "import json,sys; d=json.load(sys.stdin); pct=d['bytesUsed']/d['bytesTotal']*100; print(f'Used: {pct:.1f}%  ({d[\"bytesUsed\"]//2**30} GiB / {d[\"bytesTotal\"]//2**30} GiB)')"

# Released PVs consuming space
kubectl get pv --field-selector status.phase=Released --no-headers | wc -l

# PVCs stuck in Pending
kubectl get pvc --all-namespaces --field-selector status.phase=Pending --no-headers | wc -l
```

Act when usage exceeds 70% -- start cleaning up or expanding capacity before hitting the 85% full threshold.

Related Skills

health-chat

16
from diegosouzapw/awesome-omni-skill

Unified health conversation entry point - automatically loads all health data for each conversation, supports natural language queries, and intelligently routes to appropriate health data processing

ddd-check

16
from diegosouzapw/awesome-omni-skill

DDD設計原則チェッカー(AIDLC ドキュメントと実装コードの一貫性を検証)

COMPLIANCE_CHECK

16
from diegosouzapw/awesome-omni-skill

Apply the OpenAI SDK compliance checklist to audit files or directories and produce a Markdown report with findings and suggested fixes. Use when asked to "check compliance", "run compliance check", or "audit against OpenAI SDK rules".

acc-check-immutability

16
from diegosouzapw/awesome-omni-skill

Analyzes PHP code for immutability violations. Checks Value Objects, Events, DTOs for readonly properties, no setters, final classes, and wither patterns. Ensures domain objects maintain invariants.

editing-checklist

16
from diegosouzapw/awesome-omni-skill

Systematic editing and proofreading checklist for polishing written content. Use this skill when reviewing, editing, or proofreading drafts before publishing.

check-x-md-content-rule

16
from diegosouzapw/awesome-omni-skill

This rule reminds the AI to check the x.md file for the current file contents and implementations.

ai-content-quality-checker

16
from diegosouzapw/awesome-omni-skill

AI生成コンテンツの総合品質チェックスキル。読みやすさ、正確性、関連性、独自性、SEO、アクセシビリティ、エンゲージメント、文法・スタイルを多角的に評価。

stripe-checkout-subscriptions

16
from diegosouzapw/awesome-omni-skill

Guide for creating Stripe Checkout Sessions for subscriptions in Flutter/Supabase backend. Covers flows, API preferences, and non-negotiable rules.

shellcheck-configuration

16
from diegosouzapw/awesome-omni-skill

Master ShellCheck static analysis configuration and usage for shell script quality. Use when setting up linting infrastructure, fixing code issues, or ensuring script portability.

cva-healthcare-pipeline

16
from diegosouzapw/awesome-omni-skill

Complete 5-system healthcare content pipeline for regulated medical content generation. Includes LGPD data extraction (Type B), claims identification (Type A), scientific reference search (Type C), SEO optimization (Type B), and final consolidation (Type D). Validated ROI - 99.4% time reduction, 92.4% cost reduction. Use when implementing healthcare content automation, building regulated medical systems, or optimizing production pipelines.

add-strict-checks

16
from diegosouzapw/awesome-omni-skill

Enable stricter TypeScript and linting checks to catch bugs early, especially useful when iterating with AI assistance.

Onboarding Check-in Drafter

16
from diegosouzapw/awesome-omni-skill

Draft onboarding check-in emails at 7, 14, and 30 days after deal close. Use when an onboarding milestone triggers or user asks "draft onboarding check-in", "send new customer welcome", or needs to proactively engage new accounts. Returns stage-appropriate check-in with setup assistance, adoption tips, or expansion conversation.