multiAI Summary Pending

backend-hang-debug

Diagnose and fix FastAPI hangs caused by blocking ThreadPoolExecutor shutdown in the news stream route; includes py-spy capture and non-blocking executor pattern.

231 stars

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/backend-hang-debug/SKILL.md --create-dirs "https://raw.githubusercontent.com/aiskillstore/marketplace/main/skills/benderfendor/backend-hang-debug/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/backend-hang-debug/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How backend-hang-debug Compares

Feature / Agentbackend-hang-debugStandard Approach
Platform SupportmultiLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Diagnose and fix FastAPI hangs caused by blocking ThreadPoolExecutor shutdown in the news stream route; includes py-spy capture and non-blocking executor pattern.

Which AI agents support this skill?

This skill is compatible with multi.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Backend Hang Debug

## Purpose
- Detect and resolve event-loop hangs where the FastAPI app stops responding (e.g., `curl http://localhost:8000/` times out) due to synchronous executor shutdown in the SSE news stream.
- Provide a repeatable triage flow using `py-spy` to capture live stacks and pinpoint blocking code.

## Scope
- Backend: `backend/app/api/routes/stream.py` (news stream), `backend/app/services/rss_ingestion.py` (RSS workers), startup processes.
- Tooling: `py-spy` for live stack dumps; `curl` with timeouts for smoke tests.

## Quick Triage
1. **Reproduce hang**: `curl -m 5 http://localhost:8000/` and `curl -m 5 http://localhost:8000/health`; note timeouts.
2. **Process check**: `ss -tlnp | grep 8000` to confirm listener; `ls /proc/$(pgrep -f "uvicorn app.main")/fd | wc -l` to rule out FD leak.
3. **Stack capture** (inside backend venv): `uv pip install py-spy` then `sudo /home/bender/classwork/Thesis/backend/.venv/bin/py-spy dump --pid $(pgrep -f "uvicorn app.main")` (and worker pid if multiprocess). Look for `ThreadPoolExecutor.shutdown` in `api/routes/stream.py` frames.

## Fix Pattern (non-blocking executor)
- Replace synchronous context manager `with ThreadPoolExecutor(...):` inside `event_generator` with a long-lived executor plus explicit **non-blocking** shutdown:
  - Create executor outside the context manager.
  - On client disconnect, cancel pending futures instead of awaiting shutdown.
  - In `finally`, call `executor.shutdown(wait=False, cancel_futures=True)`.
- Rationale: context manager calls `shutdown(wait=True)`, blocking the event loop if RSS worker threads hang on network I/O.

## Implementation Steps
1. **Update stream executor usage** in `backend/app/api/routes/stream.py`:
   - Instantiate `executor = concurrent.futures.ThreadPoolExecutor(max_workers=5)`.
   - Dispatch work via `loop.run_in_executor(executor, _process_source_with_debug, ...)`.
   - On disconnect, `cancel()` pending futures.
   - In `finally`, `executor.shutdown(wait=False, cancel_futures=True)`.
2. **Keep RSS executor as-is** (`rss_ingestion.py`) since it runs in background threads, but ensure request timeouts remain reasonable (currently 60s per RSS `requests.get`).
3. **Retest**:
   - Restart uvicorn; `curl -m 5 http://localhost:8000/health` should respond.
   - Start a stream request and abort the client; server must stay responsive.
   - Re-run `py-spy dump` to verify no `ThreadPoolExecutor.shutdown(wait=True)` frames in main thread.

## Verification Checklist
- [ ] `curl -m 5 http://localhost:8000/` returns a response (no hang).
- [ ] `curl -m 5 http://localhost:8000/health` succeeds.
- [ ] Aborting `/news/stream` does **not** freeze subsequent requests.
- [ ] `py-spy dump` shows event loop not blocked on `ThreadPoolExecutor.shutdown`.
- [ ] Frontend no longer stalls waiting on root/health while backend is busy with streams.

## Notes & Future Hardening
- Consider adding request timeout middleware to fail fast on slow handlers.
- Add per-source network timeouts and shorter retries for RSS feeds to reduce long-lived threads.
- If multi-worker uvicorn is used, run `py-spy` on each worker pid when diagnosing hangs.