system-profile

Profile a target (script, process, GPU, memory, interconnect) using external tools and code instrumentation. Produces structured performance reports with actionable recommendations. Use when user says "profile", "benchmark", "bottleneck", or wants performance analysis.

5,407 stars

bywanshuiyin

View on GitHub Installation ↓

Best use case

system-profile is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using system-profile should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/system-profile/SKILL.md --create-dirs "https://raw.githubusercontent.com/wanshuiyin/Auto-claude-code-research-in-sleep/main/skills/system-profile/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/system-profile/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How system-profile Compares

Feature / Agent	system-profile	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

ChatGPT vs Claude for Agent Skills

Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.

Best AI Skills for ChatGPT

Find the best AI skills to adapt into ChatGPT workflows for research, writing, summarization, planning, and repeatable assistant tasks.

SKILL.md Source

# System Profile

Profile the specified target and summarize the results. Target: $ARGUMENTS

## Instructions

You are a profiling assistant. Based on the user's target, choose appropriate profiling strategies, **including writing instrumentation code when needed**, then run profiling, analyze results, and produce a summary.

### Step 1: Determine the profiling target

Parse `$ARGUMENTS` to understand what to profile. Examples:
- A Python script or module
- A running process (PID or service name)
- A specific function or code block
- An entire framework or system (e.g., "autogen", "vllm serving") — profile its end-to-end execution, identify bottlenecks across components
- "gpu" / "interconnect" / "memory" for focused profiling

If `$ARGUMENTS` is empty or unclear, ask the user.

### Step 2: Choose profiling methods

Select from external tools and/or code instrumentation as appropriate. Don't limit yourself to the examples below — use whatever makes sense for the target.

**External tools** (check availability first):
- CPU: `cProfile`, `py-spy`, `line_profiler`, `perf stat`, `/usr/bin/time -v`
- Memory: `tracemalloc`, `memory_profiler`, `memray`
- GPU: `nvidia-smi`, `nvidia-smi dmon`, `nvitop`, `torch.profiler`, `nsys`
- Interconnect: `nvidia-smi topo -m`, `nvidia-smi nvlink`, `NCCL_DEBUG=INFO`
- System: `strace -c`, `iostat`, `vmstat`

**Code instrumentation** — when external tools are insufficient, write and insert profiling code into the target. Typical scenarios:
- Timing specific code blocks (wall time vs CPU time)
- Measuring CPU-GPU or GPU-GPU transfer size, frequency, and bandwidth
- Tracking memory allocation across CPU and GPU to detect redundancy
- Wrapping NCCL collectives to measure latency and throughput
- Adding CUDA event timing around kernels

Design the instrumentation based on what you observe in the code — don't use a fixed template.

### Step 3: Key dimensions to investigate

Depending on the target, focus on some or all of these:

**CPU overhead**
- Context switching (voluntary / involuntary)
- CPU utilization: ratio of CPU time to wall time
- Per-function execution time hotspots

**Memory overhead**
- CPU and GPU memory usage (allocated vs reserved vs peak)
- Redundant replication: same data living on both CPU and GPU
- Per-device allocation balance in multi-GPU setups

**Interconnect & communication**
- CPU-GPU transfer: frequency, per-transfer size, total volume, bandwidth achieved
- GPU-GPU transfer: P2P bandwidth, NVLink vs PCIe topology impact
- NCCL collectives: operation type, message size distribution, latency
- Communication-to-computation ratio

**GPU compute**
- SM utilization, kernel launch overhead
- Memory bandwidth utilization vs peak

### Step 4: Instrumentation guidelines

When inserting code into the target:
1. Read and understand the target code first
2. Prefer wrapping (decorator, context manager, standalone runner) over inline edits
3. If inline edits are necessary, mark them clearly (e.g., `# [PROFILE]` comments)
4. Minimize observer effect — don't instrument tight inner loops; sample instead
5. Collect results into a structured log, don't scatter print statements

### Step 5: Run profiling

1. Check available tools and hardware topology
2. Run the chosen methods, capture all output
3. Save artifacts (flamegraphs, traces, logs) to `./profile_output/`

### Step 6: Produce the report

**Part A — Profiling results** (structured tables by dimension, as applicable):
- CPU overhead table
- Memory overhead table (with redundancy column)
- Interconnect table (transfer type / frequency / size / latency / bandwidth)
- Hotspots / bottleneck identification
- Actionable recommendations ranked by expected impact

**Part B — Instrumentation changelog** (MANDATORY):
List every file that was modified or created for profiling purposes:

| File | Change type | What was added/modified | Line(s) |
|------|-------------|------------------------|---------|
| ... | modified | ... | ... |
| ... | created | ... | — |

This allows the user to review and revert all instrumentation changes.
Offer to clean up (remove all instrumentation) when the user is done.

Related Skills

vast-gpu

5407

from wanshuiyin/Auto-claude-code-research-in-sleep

Rent, manage, and destroy GPU instances on vast.ai. Use when user says "rent gpu", "vast.ai", "rent a server", "cloud gpu", or needs on-demand GPU without owning hardware.

training-check

5407

from wanshuiyin/Auto-claude-code-research-in-sleep

Periodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.

serverless-modal

5407

from wanshuiyin/Auto-claude-code-research-in-sleep

Run GPU workloads on Modal — training, fine-tuning, inference, batch processing. Zero-config serverless: no SSH, no Docker, auto scale-to-zero. Use when user says "modal run", "modal training", "modal inference", "deploy to modal", "need a GPU", "run on modal", "serverless GPU", or needs remote GPU compute.

semantic-scholar

5407

from wanshuiyin/Auto-claude-code-research-in-sleep

Search published venue papers (IEEE, ACM, Springer, etc.) via Semantic Scholar API. Complements /arxiv (preprints) with citation counts, venue metadata, and TLDR. Use when user says "search semantic scholar", "find IEEE papers", "find journal papers", "venue papers", "citation search", or wants published literature beyond arXiv preprints.

run-experiment

5407

from wanshuiyin/Auto-claude-code-research-in-sleep

Deploy and run ML experiments on local, remote, Vast.ai, or Modal serverless GPU. Use when user says "run experiment", "deploy to server", "跑实验", or needs to launch training jobs.

result-to-claim

5407

from wanshuiyin/Auto-claude-code-research-in-sleep

Use when experiments complete to judge what claims the results support, what they don't, and what evidence is still missing. Codex MCP evaluates results against intended claims and routes to next action (pivot, supplement, or confirm). Use after experiments finish — before writing the paper or running ablations.

research-review

5407

from wanshuiyin/Auto-claude-code-research-in-sleep

Get a deep critical review of research from GPT via Codex MCP. Use when user says "review my research", "help me review", "get external review", or wants critical feedback on research ideas, papers, or experimental results.

research-refine

5407

from wanshuiyin/Auto-claude-code-research-in-sleep

Turn a vague research direction into a problem-anchored, elegant, frontier-aware, implementation-oriented method plan via iterative GPT-5.4 review. Use when the user says "refine my approach", "帮我细化方案", "decompose this problem", "打磨idea", "refine research plan", "细化研究方案", or wants a concrete research method that stays simple, focused, and top-venue ready instead of a vague or overbuilt idea.

research-refine-pipeline

5407

from wanshuiyin/Auto-claude-code-research-in-sleep

Run an end-to-end workflow that chains `research-refine` and `experiment-plan`. Use when the user wants a one-shot pipeline from vague research direction to focused final proposal plus detailed experiment roadmap, or asks to "串起来", build a pipeline, do it end-to-end, or generate both the method and experiment plan together.

research-pipeline

5407

from wanshuiyin/Auto-claude-code-research-in-sleep

Full research pipeline: Workflow 1 (idea discovery) → implementation → Workflow 2 (auto review loop). Goes from a broad research direction all the way to a submission-ready paper. Use when user says "全流程", "full pipeline", "从找idea到投稿", "end-to-end research", or wants the complete autonomous research lifecycle.

research-lit

5407

from wanshuiyin/Auto-claude-code-research-in-sleep

Search and analyze research papers, find related work, summarize key ideas. Use when user says "find papers", "related work", "literature review", "what does this paper say", or needs to understand academic papers.

rebuttal

5407

from wanshuiyin/Auto-claude-code-research-in-sleep

Workflow 4: Submission rebuttal pipeline. Parses external reviews, enforces coverage and grounding, drafts a safe text-only rebuttal under venue limits, and manages follow-up rounds. Use when user says "rebuttal", "reply to reviewers", "ICML rebuttal", "OpenReview response", or wants to answer external reviews safely.