ln-811-performance-profiler
Profiles runtime performance with CPU, memory, and I/O metrics. Use when measuring bottlenecks before optimization.
Best use case
ln-811-performance-profiler is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Profiles runtime performance with CPU, memory, and I/O metrics. Use when measuring bottlenecks before optimization.
Teams using ln-811-performance-profiler should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/ln-811-performance-profiler/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How ln-811-performance-profiler Compares
| Feature / Agent | ln-811-performance-profiler | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Profiles runtime performance with CPU, memory, and I/O metrics. Use when measuring bottlenecks before optimization.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
> **Paths:** File paths (`shared/`, `references/`, `../ln-*`) are relative to skills repo root. If not found at CWD, locate this SKILL.md directory and go up one level for repo root. If `shared/` is missing, fetch files via WebFetch from `https://raw.githubusercontent.com/levnikolaevich/claude-code-skills/master/skills/{path}`.
# ln-811-performance-profiler
**Type:** L3 Worker
**Category:** 8XX Optimization
Runtime profiler that executes the optimization target, measures multiple metrics (CPU, memory, I/O, time), instruments code for per-function breakdown, and produces a standardized performance map from real data.
---
## Overview
| Aspect | Details |
|--------|---------|
| **Input** | Problem statement: target (file/endpoint/pipeline) + observed metric |
| **Output** | Performance map (multi-metric, per-function), suspicion stack, bottleneck classification |
| **Pattern** | Discover test → Baseline run → Static analysis → Deep profile → Performance map → Report |
---
## Workflow
**Phases:** Test Discovery → Baseline Run → Static Analysis → Deep Profile → Performance Map → Report
---
## Phase 0: Test Discovery/Creation
**MANDATORY READ:** Load `shared/references/ci_tool_detection.md` for test framework detection.
**MANDATORY READ:** Load `shared/references/benchmark_generation.md` for auto-generating benchmarks when none exist.
Find or create commands that exercise the optimization target. Two outputs: `test_command` (profiling/measurement) and `e2e_test_command` (functional safety gate).
### Step 1: Discover test_command
| Priority | Method | Action |
|----------|--------|--------|
| 1 | User-provided | User specifies test command or API endpoint |
| 2 | Discover existing E2E test | Grep test files for target entry point (stop at first match) |
| 3 | Create test script | Generate per `shared/references/benchmark_generation.md` to `.hex-skills/optimization/{slug}/profile_test.sh` |
**E2E discovery protocol** (stop at first match):
| Priority | Method | How |
|----------|--------|-----|
| 1 | Route-based search | Grep e2e/integration test files for entry point route |
| 2 | Function-based search | Grep for entry point function name |
| 3 | Module-based search | Grep for import of entry point module |
**Test creation** (if no existing test found):
| Target Type | Generated Script |
|-------------|-----------------|
| API endpoint | `curl -w "%{time_total}" -o /dev/null -s {endpoint}` |
| Function | Stack-specific benchmark per `shared/references/benchmark_generation.md` |
| Pipeline | Full pipeline invocation with test input |
### Step 2: Discover e2e_test_command
If `test_command` came from E2E discovery (Step 1 priority 2): `e2e_test_command = test_command`.
Otherwise, run E2E discovery protocol again (same 3-priority table) to find a separate functional safety test.
If not found: `e2e_test_command = null`, log: `WARNING: No e2e test covers {entry_point}. Full test suite serves as functional gate.`
### Output
| Field | Description |
|-------|-------------|
| `test_command` | Command for profiling/measurement |
| `e2e_test_command` | Command for functional safety gate (may equal test_command, or null) |
| `e2e_test_source` | Discovery method: user / route / function / module / none |
---
## Phase 1: Baseline Run (Multi-Metric)
Run `test_command` with system-level profiling. Capture simultaneously:
| Metric | How to Capture | When |
|--------|---------------|------|
| Wall time | `time` wrapper or test harness | Always |
| CPU time (user+sys) | `/usr/bin/time -v` or language profiler | Always |
| Memory peak (RSS) | `/usr/bin/time -v` (Max RSS) or `tracemalloc` / `process.memoryUsage()` | Always |
| I/O bytes | `/usr/bin/time -v` or structured logs | If I/O suspected |
| HTTP round-trips | Count from structured logs or application metrics | If network I/O in call graph |
| GPU utilization | `nvidia-smi --query-gpu` | Only if CUDA/GPU detected in stack |
### Baseline Protocol
| Parameter | Value |
|-----------|-------|
| Runs | 3 |
| Metric | Median |
| Warm-up | 1 discarded run |
| Output | `baseline` — multi-metric snapshot |
---
## Phase 2: Static Analysis → Instrumentation Points
**MANDATORY READ:** Load [bottleneck_classification.md](references/bottleneck_classification.md)
Trace call chain from code + build suspicion stack. **Purpose:** guide WHERE to instrument in Phase 3.
### Step 1: Trace Call Chain
Starting from entry point, trace depth-first (max depth 5). At each step, READ the full function body.
**Cross-service tracing:** If `service_topology` is available from coordinator and a step makes an HTTP/gRPC call to another service whose code is accessible:
| Situation | Action |
|-----------|--------|
| HTTP call to service with code in submodule/monorepo | Follow into that service's handler: resolve route → trace handler code (depth resets to 0 for the new service) |
| HTTP call to service without accessible code | Classify as External, record latency estimate |
| gRPC/message queue to known service | Same as HTTP — follow into handler if code accessible |
Record `service: "{service_name}"` on each step to track which service owns it. The performance_map `steps` tree can span multiple services.
**Depth-First Rule:** If code of the called service is accessible — ALWAYS profile INSIDE. NEVER classify an accessible service as "External/slow" without profiling its internals. "Slow" is a symptom, not a diagnosis.
**5 Whys for each bottleneck:** Before reporting a bottleneck, chain "why?" until you reach config/architecture level:
1. "What is slow?" → alignment service (5.9s) 2. "Why?" → 6 pairs × ~1s each 3. "Why ~1s per pair?" → O(n²) mwmf computation 4. "Why O(n²)?" → library default, not production config 5. "Why default?" → `matching_methods` not configured → **root cause = config**
### Step 2: Classify & Suspicion Scan
For each step, classify by type (CPU, I/O-DB, I/O-Network, I/O-File, Architecture, External, Cache) and scan for performance concerns.
Suspicion checklist (**minimum, not limitation**):
| Category | What to Look For |
|----------|-----------------|
| Connection management | Client created per-request? Missing pooling? Missing reuse? |
| Data flow | Data read multiple times? Over-fetching? Unnecessary transforms? |
| Async patterns | Sync I/O in async context? Sequential awaits without data dependency? |
| Resource lifecycle | Unclosed connections? Temp files? Memory accumulation in loop? |
| Configuration | Hardcoded timeouts? Default pool sizes? Missing batch size config? |
| Redundant work | Same validation at multiple layers? Same data loaded twice? |
| Architecture | N+1 in loop? Batch API unused? Cache infra unused? Sequential-when-parallel? |
| *(open)* | Anything else spotted — checklist does not limit findings |
### Step 2b: Suspicion Deduplication
**MANDATORY READ:** Load `shared/references/output_normalization.md`
After generating suspicions across all call chain steps, normalize and deduplicate per §1-§2:
- Normalize suspicion descriptions (replace specific values with placeholders)
- Group identical suspicions across different steps → merge into single entry with `affected_steps: [list]`
- Example: "Missing connection pooling" found in steps 1.1, 1.2, 1.3 → one suspicion with `affected_steps: ["1.1", "1.2", "1.3"]`
### Step 3: Verify & Map to Instrumentation Points
```
FOR each suspicion:
1. VERIFY: follow code to confirm or dismiss
2. VERDICT: CONFIRMED → map to instrumentation point | DISMISSED → log reason
3. For each CONFIRMED suspicion, identify:
- function to wrap with timing
- I/O call to count
- memory allocation to track
```
### Profiler Selection (per stack)
| Stack | Non-invasive profiler | Invasive (if non-invasive insufficient) |
|-------|----------------------|----------------------------------------|
| Python | `py-spy`, `cProfile` | `time.perf_counter()` decorators |
| Node.js | `clinic`, `--prof` | `console.time()` wrappers |
| Go | `pprof` (built-in) | Usually not needed |
| .NET | `dotnet-trace` | `Stopwatch` wrappers |
| Rust | `cargo flamegraph` | `std::time::Instant` |
**Stack detection:** per `shared/references/ci_tool_detection.md`.
---
## Phase 3: Deep Profile
### Profiler Hierarchy (escalate as needed)
| Level | Tool Examples | What It Shows | When to Use |
|-------|--------------|---------------|-------------|
| 1 | `py-spy`, `cProfile`, `pprof`, `dotnet-trace` | Function-level hotspots | Always — first pass |
| 2 | `line_profiler`, per-line timing | Line-level timing in hotspot function | Hotspot function found but cause unclear |
| 3 | `tracemalloc`, `memory_profiler` | Per-line memory allocation | Memory metrics abnormal in baseline |
### Step 1: Non-Invasive Profiling (preferred)
Run `test_command` with Level 1 profiler to get per-function breakdown without code changes.
### Step 2: Escalation Decision
After Level 1 profiler run, evaluate result against suspicion stack from Phase 2:
| Profiler Result | Action |
|-----------------|--------|
| Hotspot function identified, time breakdown confirms suspicions | DONE — proceed to Phase 4 |
| Hotspot identified but internal cause unclear (CPU vs I/O inside one function) | Escalate to Level 2 (line-level timing) |
| Memory baseline abnormal (peak or delta) | Escalate to Level 3 (memory profiler) |
| Multiple suspicions unresolved — profiler granularity insufficient | Go to Step 3 (targeted instrumentation) |
| Profiler unavailable or overhead > 20% of wall time | Go to Step 3 (targeted instrumentation) |
### Stop Conditions (Profiler Escalation)
| Condition | Action |
|-----------|--------|
| Hotspot identified with clear cause | STOP — proceed to Performance Map |
| All 3 profiler levels exhausted | STOP — build map from best available data |
| Instrumentation breaks tests | STOP — revert instrumentation, use non-invasive data only |
| Profiler overhead > 20% of wall time | STOP — skip to targeted instrumentation |
### Step 3: Targeted Instrumentation (proactive)
Add timing/logging along the call stack at instrumentation points identified in Phase 2 Step 3:
```
1. FOR each CONFIRMED suspicion without measured data:
Add timing wrapper around target function/I/O call
Add counter for I/O round-trips if network/DB suspected
(cross-service: instrument in the correct service's codebase)
2. Re-run test_command (3 runs, median)
3. Collect per-function measurements from logs
4. Record list of instrumented files (may span multiple services)
```
| Instrumentation Type | When | Example |
|---------------------|------|---------|
| Timing wrapper | Always for unresolved suspicions | `time.perf_counter()` around function call |
| I/O call counter | Network or DB bottleneck suspected | Count HTTP requests, DB queries in loop |
| Memory snapshot | Memory accumulation suspected | `tracemalloc.get_traced_memory()` before/after |
**KEEP instrumentation in place.** The executor reuses it for post-optimization per-function comparison, then cleans up after strike. Report `instrumented_files` in output.
---
## Phase 4: Build Performance Map
Standardized format — feeds into `.hex-skills/optimization/{slug}/context.md` for downstream consumption.
```yaml
performance_map:
test_command: "uv run pytest tests/automated/e2e/test_example.py -s"
baseline:
wall_time_ms: 7280
cpu_time_ms: 850
memory_peak_mb: 256
memory_delta_mb: 45
io_read_bytes: 1200000
io_write_bytes: 500000
http_round_trips: 13
steps: # service field present only in multi-service topology
- id: "1"
function: "process_job"
location: "app/services/job_processor.py:45"
service: "api" # optional — which service owns this step
wall_time_ms: 7200
time_share_pct: 99
type: "function_call"
children:
- id: "1.1"
function: "translate_binary"
wall_time_ms: 7100
type: "function_call"
children:
- id: "1.1.1"
function: "tikal_extract"
service: "tikal" # cross-service: code traced into submodule
wall_time_ms: 2800
type: "http_call"
http_round_trips: 1
- id: "1.1.2"
function: "mt_translate"
service: "mt-engine"
wall_time_ms: 3500
type: "http_call"
http_round_trips: 13
bottleneck_classification: "I/O-Network"
bottleneck_detail: "13 sequential HTTP calls to MT service (3500ms)"
top_bottlenecks:
- step: "1.1.2", type: "I/O-Network", share: 48%
- step: "1.1.1", type: "I/O-Network", share: 38%
```
---
## Phase 5: Report
### Report Structure
```
profile_result:
entry_point_info:
type: <string> # "api_endpoint" | "function" | "pipeline"
location: <string> # file:line
route: <string|null> # API route (if endpoint)
function: <string> # Entry point function name
performance_map: <object> # Full map from Phase 4
bottleneck_classification: <string> # Primary bottleneck type
bottleneck_detail: <string> # Human-readable description
top_bottlenecks:
- step, type, share, description
optimization_hints: # CONFIRMED suspicions only (Phase 2)
- hint with evidence
suspicion_stack: # Full audit trail (confirmed + dismissed)
- category: <string>
location: <string>
description: <string>
verdict: <string> # "confirmed" | "dismissed"
evidence: <string>
verification_note: <string>
e2e_test:
command: <string|null> # E2E safety test command (from Phase 0)
source: <string> # user / route / function / module / none
instrumented_files: [<string>] # Files with active instrumentation (empty if non-invasive only)
wrong_tool_indicators: [] # Empty = proceed, non-empty = exit
```
### Wrong Tool Indicators
| Indicator | Condition |
|-----------|-----------|
| `external_service_no_alternative` | 90%+ measured time in external service, no batch/cache/parallel path |
| `within_industry_norm` | Measured time within expected range for operation type |
| `infrastructure_bound` | Bottleneck is hardware (measured via system metrics) |
| `already_optimized` | Code already uses best patterns (confirmed by suspicion scan) |
---
## Error Handling
| Error | Recovery |
|-------|----------|
| Cannot resolve entry point | Block: "file/function not found at {path}" |
| Test command fails on unmodified code | Block: "test fails before profiling — fix test first" |
| Profiler not available for stack | Fall back to invasive instrumentation (Phase 3 Step 2) |
| Instrumentation breaks tests | Revert immediately: `git checkout -- .` |
| Call chain too deep (> 5 levels) | Stop at depth 5, note truncation |
| Cannot classify step type | Default to "Unknown", use measured time |
| No I/O detected (pure CPU) | Classify as CPU, focus on algorithm profiling |
---
## References
- [bottleneck_classification.md](references/bottleneck_classification.md) — classification taxonomy
- [latency_estimation.md](references/latency_estimation.md) — latency heuristics (fallback for static-only mode)
- `shared/references/ci_tool_detection.md` — stack/tool detection
- `shared/references/benchmark_generation.md` — benchmark templates per stack
---
## Runtime Summary Artifact
**MANDATORY READ:** Load `shared/references/coordinator_summary_contract.md`
Write `.hex-skills/runtime-artifacts/runs/{run_id}/optimization-profile/{slug}.json` before finishing.
## Definition of Done
- [ ] Test command discovered or created for optimization target
- [ ] E2E safety test discovered (or documented as unavailable)
- [ ] Baseline measured: wall time, CPU, memory (3 runs, median)
- [ ] Call graph traced and function bodies read
- [ ] Suspicion stack built: each suspicion verified and mapped to instrumentation point
- [ ] Deep profile completed (non-invasive preferred, invasive if needed)
- [ ] Instrumented files reported (cleanup deferred to executor)
- [ ] Performance map built in standardized format (real measurements)
- [ ] Top 3 bottlenecks identified from measured data
- [ ] Wrong tool indicators evaluated from real metrics
- [ ] optimization_hints contain only CONFIRMED suspicions with measurement evidence
- [ ] Report prepared with measured findings
- [ ] Optimization profile artifact written to the shared location
---
**Version:** 3.0.0
**Last Updated:** 2026-03-15Related Skills
ln-810-performance-optimizer
Multi-cycle performance optimization with profiling and bottleneck analysis. Use when optimizing application performance.
ln-653-runtime-performance-auditor
Checks blocking IO in async, unnecessary allocations, sync sleep, string concat in loops, redundant copies. Use when auditing runtime performance.
ln-650-persistence-performance-auditor
Coordinates persistence and performance audit across queries, transactions, runtime, and resource lifecycle. Use when auditing data layer performance.
ln-914-community-responder
Responds to unanswered GitHub discussions and issues with codebase-informed replies. Use when clearing community question backlog.
ln-913-community-debater
Launches RFC and debate discussions on GitHub. Use when proposing changes that need community input or voting.
ln-912-community-announcer
Composes and publishes announcements to GitHub Discussions. Use when sharing releases, updates, or news with the community.
ln-911-github-triager
Produces prioritized triage report from open GitHub issues, PRs, and discussions. Use when reviewing community backlog.
ln-910-community-engagement
Analyzes community health and delegates engagement tasks. Use when managing GitHub issues, discussions, and announcements.
ln-840-benchmark-compare
Runs built-in vs hex-line benchmark with scenario manifests, activation checks, and diff-based correctness. Use when measuring hex-line MCP performance against built-in tools.
ln-832-bundle-optimizer
Reduces JS/TS bundle size via tree-shaking, code splitting, and unused dependency removal. Use when optimizing frontend bundle size.
ln-831-oss-replacer
Replaces custom modules with OSS packages using atomic keep/discard testing. Use when migrating custom code to established libraries.
ln-830-code-modernization-coordinator
Modernizes codebase via OSS replacement and bundle optimization. Use when acting on audit findings to reduce custom code.