performance-profiling

Performance profiling workflows: CPU profiling (pprof, py-spy, async-profiler, 0x), memory profiling (heap analysis, leak detection), flamegraph interpretation, latency analysis (P50/P99/P99.9), and profiling anti-patterns per language (Go, Python, JVM, Node.js).

8 stars

bymarvinrichter

View on GitHub Installation ↓

Best use case

performance-profiling is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using performance-profiling should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/performance-profiling/SKILL.md --create-dirs "https://raw.githubusercontent.com/marvinrichter/clarc/main/skills/performance-profiling/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/performance-profiling/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How performance-profiling Compares

Feature / Agent	performance-profiling	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Performance Profiling Skill

Load testing says "it's slow." Profiling tells you why. This skill covers profiling workflows for CPU, memory, and latency across languages — from running the profiler to reading the flamegraph.

## When to Activate

- A service is slower than its SLO and the cause is unknown
- Post-load-test: numbers are bad, now find the bottleneck
- Reviewing a pull request that touches hot code paths
- Investigating memory growth over time (leak detection)
- Optimizing before a traffic spike

---

## CPU Profiling

### Go — pprof (built-in, zero setup)

```go
// Add to your HTTP server
import _ "net/http/pprof"

func main() {
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()
    // ... rest of your server
}
```

```bash
# Profile for 30 seconds (service must be running)
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Inside pprof interactive mode:
(pprof) top10          # Top 10 functions by CPU time
(pprof) top10 -cum     # Top 10 by cumulative (includes callees)
(pprof) web            # Open flamegraph in browser (requires graphviz)
(pprof) list MyFunc    # Annotated source for a specific function

# Direct flamegraph (no interactive)
go tool pprof -http=:8090 http://localhost:6060/debug/pprof/profile?seconds=30
```

```bash
# goroutine stack dump (for concurrency issues)
curl http://localhost:6060/debug/pprof/goroutine?debug=2

# Block profile (channel/lock contention)
curl "http://localhost:6060/debug/pprof/block?seconds=10" > block.pprof
go tool pprof block.pprof

# Mutex profile
curl "http://localhost:6060/debug/pprof/mutex?seconds=10" > mutex.pprof
go tool pprof mutex.pprof
```

### Python — py-spy (sampling, zero overhead, no code changes)

```bash
# Install
pip install py-spy

# Attach to running process (get PID with: ps aux | grep python)
py-spy top --pid 12345          # Live top-like view
py-spy record --pid 12345 -o profile.svg --duration 30

# Profile from start
py-spy record -- python myapp.py -o profile.svg

# Open profile.svg in browser — it's an interactive SVG flamegraph
```

```python
# cProfile (built-in, deterministic — higher overhead)
import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()
# ... code to profile ...
profiler.disable()

stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 functions

# snakeviz — browser-based visualization
# pip install snakeviz
# python -m cProfile -o output.prof myapp.py
# snakeviz output.prof
```

### JVM (Java/Kotlin/Scala) — async-profiler

```bash
# Download async-profiler
curl -L https://github.com/async-profiler/async-profiler/releases/latest/download/async-profiler-linux-x64.tar.gz | tar xz

# Attach to running JVM (get PID with: jps)
./profiler.sh -d 30 -f profile.html 12345
# Open profile.html in browser — interactive flamegraph

# Allocation profiling (memory allocation hot spots)
./profiler.sh -e alloc -d 30 -f alloc.html 12345

# Lock profiling
./profiler.sh -e lock -d 30 -f lock.html 12345
```

```bash
# Java Flight Recorder (JDK built-in, low overhead)
# Start recording
jcmd 12345 JFR.start duration=60s filename=recording.jfr settings=profile

# Open in JDK Mission Control (jmc)
jmc recording.jfr
```

### Node.js — 0x (flamegraph generation)

```bash
# Install
npm install -g 0x

# Profile your app
0x -- node app.js

# Or with existing app: inject --prof
node --prof app.js
# ... run load test ...
node --prof-process isolate-*.log > processed.txt

# Better: 0x for flamegraphs
0x -o profile/ -- node -r ts-node/register src/server.ts
# Opens flamegraph in browser automatically
```

### Linux — perf (system-wide, language-agnostic)

```bash
# Record CPU events
sudo perf record -g -p $(pgrep myapp) sleep 30

# Generate report
sudo perf report

# Flamegraph (requires Brendan Gregg's scripts)
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > perf.svg
```

---

## Memory Profiling

### Go — heap profile

```bash
# Heap profile
curl "http://localhost:6060/debug/pprof/heap" > heap.pprof
go tool pprof heap.pprof

# Inside pprof:
(pprof) top20          # Top allocators
(pprof) inuse_objects  # Currently allocated objects (not GC'd)
(pprof) alloc_objects  # Total allocations since start
(pprof) web            # Flamegraph

# Memory escape analysis (build-time)
go build -gcflags="-m=2" ./... 2>&1 | grep "escapes to heap"
# Objects that escape to heap increase GC pressure
```

### Python — tracemalloc + memory-profiler

```python
# tracemalloc (built-in, no overhead until enabled)
import tracemalloc

tracemalloc.start()
# ... code ...
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
    print(stat)

# Compare two snapshots to find leaks
snapshot2 = tracemalloc.take_snapshot()
top_stats = snapshot2.compare_to(snapshot1, 'lineno')
for stat in top_stats[:10]:
    print(stat)
```

```bash
# memory-profiler — line-by-line memory usage
pip install memory-profiler

# Add @profile decorator to function you want to profile
# mprof run python myapp.py
# mprof plot  # Shows memory over time graph
```

```python
# objgraph — find reference cycles and leaks
import objgraph
objgraph.show_most_common_types(20)
objgraph.show_growth()   # Objects that grew since last call
# objgraph.show_refs(obj, max_depth=3)  # Reference graph for an object
```

### Node.js — MemLab / Chrome Memory Snapshot

```bash
# MemLab (Meta — detects memory leaks in browser/Node.js)
npm install -g memlab
memlab run --scenario scenario.js
```

```javascript
// Chrome DevTools in Node.js
const v8 = require('v8');
const heapStats = v8.getHeapStatistics();
console.log(heapStats);

// Take heap snapshot
const inspector = require('inspector');
const session = new inspector.Session();
session.connect();
session.post('HeapProfiler.takeHeapSnapshot', null, (err, r) => {
    console.log('Snapshot taken');
    session.disconnect();
});
```

---

## Flamegraph Interpretation

A flamegraph visualizes where CPU time is spent:

```
Wide bar → function spends a lot of time here (or is called often)
Tall stack → deep call chain
Flat top → leaf function where CPU is actually spent (the bottleneck)
Wide base → frequently called root function

Reading a flamegraph:
1. Find the widest bars — most time spent
2. Look at the TOP of wide towers — that's where time is actually consumed
3. Wide, flat tops = CPU-bound work (the code itself is slow)
4. Wide, tall stacks = I/O or lock wait (many layers, but nothing at top)

CPU-bound vs I/O-bound:
- CPU: top of flamegraph is full of compute (string ops, parsing, crypto)
- I/O: top has many sys_read/epoll_wait/futex calls (waiting, not computing)
```

---

## Latency Analysis

### Percentiles Matter

```
P50 (median): 50% of requests are faster than this
P95:          95% of requests are faster than this
P99:          99% of requests are faster than this
P99.9:        999 of 1000 requests are faster than this

Why tail latency matters:
- A user making 100 requests per page load will hit P99 on almost every page
- P50 being great doesn't help users who hit P99.9
- SLOs should be on P99, not P50
```

### Common Tail Latency Causes

| Cause | Symptom | Diagnosis |
|-------|---------|-----------|
| GC pauses | Intermittent spikes at P99 | `gc` in flamegraph, GC logs |
| Lock contention | Flat peaks at mutex/futex | mutex profile in pprof, lock profile in async-profiler |
| Thread pool saturation | Queuing delay | Thread pool metrics, queue depth |
| Network jitter | Variation independent of code | Separate network latency measurement |
| Database slow query | Matches DB query timing | DB slow query log |
| Connection pool exhaustion | Spikes correlate with high load | Pool wait time metrics |

---

## Profiling Anti-Patterns

| Anti-Pattern | Problem | Correct Approach |
|--------------|---------|-----------------|
| Profile in development only | Dev load != prod load | Profile under realistic traffic in staging |
| Profile cold start | JIT/warm-up biases results | Run load for 60s before profiling |
| Profile for 1 second | Too short, sampling noise | Profile for 30-60 seconds minimum |
| Profile without load | Idle profiling shows nothing | Run load test simultaneously |
| Only look at P50 | Misses tail latency | Profile P99/P99.9 tail |
| Optimize the first hotspot | May not be the bottleneck | Fix biggest, re-profile to confirm |
| Micro-benchmark without warmup | JIT not kicked in | Add warmup iterations before measuring |

---

## Quick Reference — Tool Selection

| Language | CPU Profiling | Memory Profiling | Recommended |
|----------|---------------|-----------------|-------------|
| Go | pprof (built-in) | pprof heap | pprof — zero setup |
| Python | py-spy | tracemalloc | py-spy — no code changes |
| Java | async-profiler | async-profiler alloc | async-profiler |
| Node.js | 0x / --prof | MemLab, heapdump | 0x for flamegraphs |
| Rust | cargo flamegraph | valgrind/heaptrack | cargo flamegraph |
| C/C++ | perf | valgrind massif | perf + valgrind |
| Linux (any) | perf | valgrind | perf for system-wide |

---

## Reference Commands

- `/profile` — guided profiling workflow for any language
- `load-testing` skill — generate realistic load during profiling session
- `observability` skill — set up continuous performance metrics
- `web-performance` skill — browser/frontend performance profiling

Related Skills

web-performance

from marvinrichter/clarc

Web performance optimization: Core Web Vitals (LCP, CLS, INP), Lighthouse CI with budget configuration, bundle analysis (webpack-bundle-analyzer, vite-bundle-visualizer), hydration performance, network waterfall reading, image optimization (WebP/AVIF, srcset), and font performance.

wasm-performance

from marvinrichter/clarc

WebAssembly performance: wasm-opt binary optimization, size reduction (panic=abort, LTO, strip), profiling WASM in Chrome DevTools, memory management (linear memory, avoiding GC pressure), SIMD, and multi-threading with SharedArrayBuffer.

zero-trust-patterns

from marvinrichter/clarc

Zero-Trust security patterns — mTLS between microservices (Istio/SPIFFE), SPIRE workload identity, OPA/Envoy authorization, NetworkPolicy default-deny-all, short-lived credentials, service mesh security, and Kubernetes RBAC hardening.

wireframing

from marvinrichter/clarc

Wireframing and prototyping workflow: fidelity levels (lo-fi sketch → mid-fi wireframe → hi-fi prototype), tool selection (Figma, Excalidraw, Balsamiq), user flow diagrams, wireframe annotation standards, information architecture (IA) mapping, and the handoff from wireframe to visual design. For developers who need to communicate UI structure before writing code.

webrtc-patterns

from marvinrichter/clarc

WebRTC patterns — peer connection setup, ICE/STUN/TURN configuration, signaling server design, SFU vs mesh topology, screen sharing, media track management, and reconnect/ICE restart handling.

webhook-patterns

from marvinrichter/clarc

Webhook patterns for receiving, verifying (HMAC), and idempotently processing third-party events. Covers Stripe, GitHub, and generic webhook patterns, delivery guarantees, retry handling, and testing.

wasm-patterns

from marvinrichter/clarc

WebAssembly patterns: wasm-pack, wasm-bindgen (JS↔Wasm interop), WASI, Component Model, wasm-opt, Rust-to-WASM compilation, JS integration (web workers, streaming instantiation), and production deployment (CDN, Content-Type headers).

visual-testing

from marvinrichter/clarc

Visual Regression Testing: tool comparison (Chromatic/Percy/Playwright screenshots/BackstopJS), pixel-diff vs AI-based comparison, baseline management, flakiness strategies (masks, tolerances, waitForLoadState), CI integration with GitHub Actions, and Storybook integration.

visual-identity

from marvinrichter/clarc

Brand identity development: color palette construction (primary/secondary/semantic/neutral), logo concept brief writing, typeface pairings, brand voice definition, mood board direction, and Brand Guidelines document structure. Use when establishing or evolving a visual brand — not for implementing existing tokens.

ux-micro-patterns

from marvinrichter/clarc

UX micro-patterns for every product state: Empty States, Loading States (skeleton screens, spinners, optimistic UI), Error States, Success States, Confirmation Dialogs, Onboarding Flows, and Progressive Disclosure. These patterns apply to every feature — done wrong, they're the biggest source of user confusion.

typography-design

from marvinrichter/clarc

Typography as a creative discipline: typeface selection criteria, type pairing (serif + sans, display + body), modular scale systems, line-height and tracking ratios, hierarchy construction, and web/mobile rendering considerations. The decisions behind design tokens, not the tokens themselves.

typescript-testing

from marvinrichter/clarc

TypeScript testing patterns: Vitest for unit/integration, Playwright for E2E, MSW for API mocking, Testing Library for React components. Core TDD methodology for TypeScript/JavaScript projects.