firecrawl-research-patterns
Programmatic Firecrawl usage, self-hosted operations, academic paper routing, recursive deep research, and raw corpus persistence. TRIGGERS - firecrawl search, firecrawl scrape, academic paper, arxiv, deep research, recursive search, research pattern, corpus persistence, firecrawl, self-hosted scraping, web scrape, scraper wrapper, littleblack, Tailscale scraping.
Best use case
firecrawl-research-patterns is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Programmatic Firecrawl usage, self-hosted operations, academic paper routing, recursive deep research, and raw corpus persistence. TRIGGERS - firecrawl search, firecrawl scrape, academic paper, arxiv, deep research, recursive search, research pattern, corpus persistence, firecrawl, self-hosted scraping, web scrape, scraper wrapper, littleblack, Tailscale scraping.
Teams using firecrawl-research-patterns should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/firecrawl-research-patterns/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How firecrawl-research-patterns Compares
| Feature / Agent | firecrawl-research-patterns | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Programmatic Firecrawl usage, self-hosted operations, academic paper routing, recursive deep research, and raw corpus persistence. TRIGGERS - firecrawl search, firecrawl scrape, academic paper, arxiv, deep research, recursive search, research pattern, corpus persistence, firecrawl, self-hosted scraping, web scrape, scraper wrapper, littleblack, Tailscale scraping.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Firecrawl Research Patterns
Programmatic patterns for using self-hosted Firecrawl in research workflows — search, scrape, route academic papers, run recursive deep research, and persist raw results for future re-analysis. Also covers self-hosted deployment, health checks, and recovery.
For archiving AI chat conversations (ChatGPT/Gemini shares), see `Skill(gh-tools:research-archival)`.
---
> **Self-Evolving Skill**: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.
## FIRST — TodoWrite Task Templates
**MANDATORY**: Select and load the appropriate template before any research work.
### Template A — Single Firecrawl Search + Persist
```
1. Health check — GET http://littleblack:3002/v1/health (fallback: test search)
2. Execute search — POST /v1/search with query, limit, scrapeOptions
3. Persist raw results — save each result page to docs/research/corpus/ with frontmatter
4. Update corpus index — append entries to docs/research/corpus-index.jsonl
5. Extract findings — summarize key learnings from raw corpus files
```
### Template B — Academic Paper Retrieval + Persist
```
1. Identify source — classify URL/DOI per academic-paper-routing.md decision tree
2. Route to scraper — arxiv direct HTML, Semantic Scholar API, Firecrawl, or Jina Reader
3. Scrape content — execute fetch with appropriate method and timeout
4. Persist raw result — save to docs/research/corpus/ with academic-specific frontmatter
5. Update corpus index — append entry to corpus-index.jsonl
6. Summarize paper — extract key claims, methods, results from raw corpus file
```
### Template C — Full Recursive Deep Research with Corpus
```
1. Health check — verify Firecrawl reachable at littleblack:3002
2. Initialize parameters — set breadth (default 4), depth (default 2), concurrency (default 2)
3. Generate search queries — LLM generates N queries from topic + prior learnings
4. Execute searches — Firecrawl /v1/search for each query via p-limit(concurrency)
5. Persist raw results — save ALL scraped pages to docs/research/corpus/ with provenance
6. Extract learnings — LLM extracts key findings + follow-up questions per result set
7. Recurse — for each follow-up, recurse with breadth=ceil(breadth/2), depth=depth-1
8. Base case — depth=0, return accumulated learnings
9. Synthesize report — LLM generates final markdown from all learnings
10. Write session report — save to docs/research/sessions/ with corpus file references
11. Update corpus index — append all new entries to corpus-index.jsonl
```
### Template D — Corpus Review / Re-Analysis
```
1. Inventory corpus — read docs/research/corpus-index.jsonl, filter by session/topic/date
2. Read raw files — load matching corpus files from docs/research/corpus/
3. Re-analyze — extract new insights with current context/questions
4. Update session report — amend or create new session report in docs/research/sessions/
```
### Template E — Image-Rich Paper with Inline Figures
Use when paper contains architecture diagrams, result plots, attention maps, or any critical visual content.
```
1. Scrape text — use port 3003 (preferred, preserves absolute image URLs) or Jina fallback
2. Detect figures — scan scraped markdown for  patterns with .png/.jpg/.svg
3. Extract figure URLs — for arXiv: probe https://arxiv.org/html/{id}v{n}/x{N}.png until 404
4. Keep URLs inline — DO NOT rewrite to local relative paths (breaks GitHub rendering)
5. Ensure inline embedding — markdown body must have  for each figure
6. Catalog in frontmatter — add figure_count and figure_urls list (all absolute URLs)
7. Save corpus file — GFM markdown with inline absolute URLs renders on GitHub without hosting
8. Update corpus-index.jsonl — include has_figures: true, figure_count, figure_urls
```
---
## Section 1 — Programmatic Firecrawl Usage
**Instance**: Self-hosted on littleblack. Reachable via Tailscale MagicDNS (`littleblack`, preferred) or legacy ZeroTier IP (`172.25.236.1`, fragile fallback). No API key needed.
### Why `fetch()` Instead of `@mendable/firecrawl-js` SDK
The official SDK uses `jiti` for dynamic imports, which is incompatible with Bun's module resolution. Direct `fetch()` calls are simpler, more reliable, and have zero dependencies.
### Two Endpoints
| Endpoint | Purpose | When to Use |
| ----------------- | --------------------- | ------------------------------------------------- |
| `POST /v1/search` | Search + scrape combo | Research queries — returns multiple scraped pages |
| `POST /v1/scrape` | Single URL scrape | Known URL — extract markdown from one page |
See [api-endpoint-reference.md](./references/api-endpoint-reference.md) for full request/response contracts.
### Quick Examples
**Search** (returns multiple results with markdown):
```typescript
const res = await fetch("http://littleblack:3002/v1/search", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
query: "mixture of experts scaling laws",
limit: 5,
scrapeOptions: { formats: ["markdown"] },
}),
});
const { data } = await res.json(); // data: [{ url, markdown, metadata }]
```
**Scrape** (single URL):
```typescript
const res = await fetch("http://littleblack:3002/v1/scrape", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
url: "https://arxiv.org/abs/2401.12345",
formats: ["markdown"],
waitFor: 3000, // ms — for JS-heavy pages
}),
});
const { data } = await res.json(); // data: { markdown, metadata }
```
### Error Handling
```typescript
// Always set a timeout
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 15_000);
try {
const res = await fetch(url, { ...opts, signal: controller.signal });
if (!res.ok) throw new Error(`Firecrawl: ${res.status} ${res.statusText}`);
const json = await res.json();
if (!json.data || (Array.isArray(json.data) && json.data.length === 0)) {
// Empty results — not an error, but no content to process
}
} finally {
clearTimeout(timeoutId);
}
```
### Health Check
```typescript
// Quick health check before starting a research session
const res = await fetch("http://littleblack:3002/v1/health");
if (!res.ok)
throw new Error(
"Firecrawl unhealthy — see self-hosted-operations.md and self-hosted-troubleshooting.md references",
);
```
---
## Section 2 — Academic Paper Routing
Route paper retrieval to the most effective method based on source. Full decision tree in [academic-paper-routing.md](./references/academic-paper-routing.md).
### Quick Reference
| Source | Best Method | Fallback |
| ----------------- | ------------------------------------- | ------------------------- |
| arxiv.org | Direct HTML (`/html/ID`) | Firecrawl `/v1/scrape` |
| Semantic Scholar | API (`api.semanticscholar.org`) | Firecrawl search by title |
| ACL Anthology | Firecrawl `/v1/scrape` | Direct PDF download |
| NeurIPS/ICML/ICLR | Firecrawl `/v1/scrape` with `waitFor` | Search by title |
| IEEE Xplore | Firecrawl with `waitFor: 3000` | Author's website |
| ACM DL | Firecrawl with `waitFor: 3000` | Author's website |
| Author blogs | Jina Reader (`r.jina.ai`) | Firecrawl `/v1/scrape` |
| Google Scholar | Firecrawl `/v1/search` | Direct search query |
### DOI Resolution
```typescript
// DOI → publisher URL → route to appropriate scraper
const res = await fetch(`https://doi.org/${doi}`, { redirect: "follow" });
const publisherUrl = res.url; // e.g., https://dl.acm.org/doi/10.1145/...
// Then route publisherUrl through the decision tree above
```
---
## Section 3 — Recursive Research Protocol
The iterative search → extract → recurse → synthesize pattern. Full step-by-step protocol in [recursive-research-protocol.md](./references/recursive-research-protocol.md).
### Algorithm Overview
```
deepResearch(topic, breadth=4, depth=2, concurrency=2):
1. Generate N search queries (N = breadth) from topic + prior learnings
2. For each query (via p-limit concurrency):
a. Firecrawl /v1/search → get results
b. PERSIST each raw result to docs/research/corpus/
c. Extract learnings + follow-up questions
3. For each follow-up question:
→ Recurse with breadth=ceil(breadth/2), depth=depth-1
4. Base case: depth=0 → return accumulated learnings
5. Synthesize final report from all learnings
6. Write session report to docs/research/sessions/
```
### Default Parameters (from working implementation)
| Parameter | Default | Max | Rationale |
| ------------- | ------- | --- | ------------------------------------------------------- |
| `breadth` | 4 | — | Number of parallel search queries per level |
| `depth` | 2 | 5 | Recursion levels (depth > 5 yields diminishing returns) |
| `concurrency` | 2 | — | Parallel Firecrawl requests (self-hosted, be gentle) |
| `limit` | 5 | — | Results per search query |
| `timeout` | 15000ms | — | Per-search timeout |
### Token Budget
Each search returns up to 5 pages. Trim each page to ~25,000 tokens before LLM processing:
```typescript
function trimToTokenLimit(text: string, maxTokens: number): string {
if (!text) return "";
const estimatedTokens = Math.ceil(text.length / 3.5);
if (estimatedTokens <= maxTokens) return text;
const maxChars = Math.floor(maxTokens * 3.5 * 0.8);
return text.slice(0, maxChars);
}
```
### Partial Failure Principle
**Partial results are better than total failure.** If a query fails, log it and continue with remaining queries. Never abort the entire research session because one query timed out.
---
## Section 4 — Raw Corpus Persistence
**Critical principle**: Every Firecrawl-scraped page must be persisted in its **original raw markdown** with provenance metadata. Synthesized reports reference these originals but never replace them.
Full format specification in [corpus-persistence-format.md](./references/corpus-persistence-format.md).
### Directory Layout
```
{project-root}/
├── docs/research/
│ ├── corpus/ # Raw scraped pages (committed)
│ │ └── YYYY-MM-DD-{slug}.md # One file per scraped URL
│ ├── sessions/ # Research session reports (committed)
│ │ └── YYYY-MM-DD-{topic-slug}.md # Synthesized report with corpus refs
│ └── corpus-index.jsonl # Append-only registry (committed)
```
### Corpus File Frontmatter
```yaml
---
source_url: https://arxiv.org/html/2401.12345
scraped_at: "2026-02-25T14:30:00Z"
scraper: firecrawl
firecrawl_endpoint: /v1/search
search_query: "mixture of experts scaling"
result_index: 2
research_session: "2026-02-25-moe-scaling"
depth_level: 1
claude_code_uuid: SESSION_UUID
content_tokens_approx: 4200
---
[RAW MARKDOWN FROM FIRECRAWL — NEVER MODIFIED]
```
### Key Rules
1. Content below `---` is the **exact markdown Firecrawl returned** — no summarization, trimming, or reformatting
2. One file per URL per scrape — if the same URL is scraped in multiple sessions, each gets its own timestamped file
3. File naming: `YYYY-MM-DD-{slug}.md` where slug is kebab-case from page title or URL path (max 60 chars)
4. Session reports in `docs/research/sessions/` reference corpus files by relative path
### Corpus Index (JSONL)
```json
{
"url": "https://arxiv.org/html/2401.12345",
"file": "corpus/2026-02-25-moe-scaling-arxiv-2401-12345.md",
"scraped_at": "2026-02-25T14:30:00Z",
"session": "2026-02-25-moe-scaling",
"tokens": 4200,
"scraper": "firecrawl"
}
```
### Why This Matters
- **LLM re-analysis**: Future sessions can re-read raw corpus files and extract different insights with better prompts or newer models
- **No information loss**: Synthesis drops details; raw files preserve everything Firecrawl captured
- **Deduplication awareness**: The JSONL index lets agents skip URLs already in the corpus
- **Git-friendly**: Markdown files diff cleanly, JSONL is append-only
---
## Section 5 — Self-Hosted Operations
The Firecrawl instance runs on **littleblack** (Tailscale: `littleblack.tail0f299b.ts.net`, legacy ZeroTier: `172.25.236.1`). All 5 containers up 5+ weeks, stable. No API key needed.
| Port | Service | Type | Purpose |
| ---- | ----------------- | ------ | -------------------------------------------------- |
| 3002 | Firecrawl API | Docker | Core scraping engine (direct API) |
| 3003 | Scraper Wrapper | Bun | JS-rendered SPAs, saves to file, returns Caddy URL |
| 3004 | Cloudflare Bypass | Bun | curl-impersonate for Cloudflare-protected sites |
| 8080 | Caddy | Binary | Serves saved markdown from firecrawl-output/ |
**When to use which port:**
| Site Type | Port | Why |
| ---------------------- | ---- | --------------------------------------------- |
| arXiv / standard pages | 3003 | Playwright JS rendering, preserves image URLs |
| Claude artifacts | 3004 | Cloudflare blocks Playwright |
| Gemini/ChatGPT shares | 3003 | Needs JS rendering (SPA) |
| Other Cloudflare sites | 3004 | If 3003 gets a Cloudflare challenge |
```bash
# Standard scrape (port 3003 — JS rendering + save)
curl "http://littleblack:3003/scrape?url=URL&name=NAME"
# Cloudflare bypass (port 3004)
curl "http://littleblack:3004/scrape-cf?url=URL&name=NAME"
# Health checks (no SSH required)
curl -s --max-time 4 http://littleblack:3003/health
curl -s --max-time 4 http://littleblack:3004/health
curl -s --max-time 4 http://littleblack:8080/
```
For architecture diagrams, health checks, recovery commands, and deployment details, see:
- [Self-Hosted Operations](./references/self-hosted-operations.md) — Architecture, health checks, recovery commands
- [Self-Hosted Bootstrap Guide](./references/self-hosted-bootstrap-guide.md) — Fresh installation (7 steps)
- [Self-Hosted Best Practices](./references/self-hosted-best-practices.md) — Docker restart policies, monitoring
- [Self-Hosted Troubleshooting](./references/self-hosted-troubleshooting.md) — Symptom-based diagnosis
---
## Section 6 — Image and Figure Capture
Text-only scrapers (Jina, direct Firecrawl) capture prose but lose architecture diagrams, result plots, and attention maps. For image-rich papers, always capture figures.
### When to Capture Images
Capture figures when the paper contains any of:
- Architecture diagrams (model structure, attention patterns)
- Benchmark/result comparison plots
- Qualitative examples (generated outputs, visualizations)
- Algorithm flowcharts or pseudocode figures
### arXiv HTML Figure URL Discovery
arXiv HTML papers store figures at sequential absolute URLs (`x1.png`, `x2.png`, ...). Probe to discover all figure URLs — do NOT download them locally:
```bash
ARXIV_ID="2312.00752"
ARXIV_VER="v2"
BASE_URL="https://arxiv.org/html/${ARXIV_ID}${ARXIV_VER}"
FIGURE_URLS=()
# Probe sequential URLs until 404 — collect absolute URLs only
for i in $(seq 1 50); do
url="${BASE_URL}/x${i}.png"
status=$(curl -s -o /dev/null -w "%{http_code}" "$url")
if [ "$status" != "200" ]; then
echo "Stopped at x${i}.png (${status}) — found ${#FIGURE_URLS[@]} figures"
break
fi
FIGURE_URLS+=("$url")
echo "Found: $url"
done
```
The collected absolute URLs go directly into the markdown body and frontmatter — no local copies needed.
### Inline Figure Embedding (GFM)
Each figure must appear inline in the corpus markdown as an absolute URL so GitHub renders it in-place:
```markdown
## Key Figures



```
> **Never rewrite to relative paths** like `./figures/x1.png` — relative paths break on GitHub unless images are committed to the same repo.
### Extracting Existing Inline URLs from Scraped Markdown
When port 3003 (Playwright) already embedded absolute URLs in the scraped markdown, extract them for the frontmatter catalog:
```bash
CORPUS_FILE="docs/research/corpus/2026-03-13-mamba-ssm.md"
# Extract all absolute image URLs already in the markdown
grep -oE 'https://[^)]+\.(png|jpg|svg|gif|webp)' "$CORPUS_FILE" | sort -u
```
These URLs are already inline — just copy them into the frontmatter `figure_urls` list.
### Frontmatter for Image-Rich Papers
The YAML frontmatter catalogs all figure source URLs for provenance. The markdown body embeds them inline:
```yaml
---
source_url: https://arxiv.org/html/2312.00752v2
scraped_at: "2026-03-13T00:00:00Z"
scraper: firecrawl-port3003
tags: [ssm, state-space-model, mamba, sequence-modeling]
content_tokens_approx: 4200
has_figures: true
figure_count: 12
figure_urls:
- https://arxiv.org/html/2312.00752v2/x1.png
- https://arxiv.org/html/2312.00752v2/x2.png
- https://arxiv.org/html/2312.00752v2/x3.png
- https://arxiv.org/html/2312.00752v2/x4.png
- https://arxiv.org/html/2312.00752v2/x5.png
---
```
### Corpus Index Entry with Figures
```json
{
"url": "https://arxiv.org/html/2312.00752v2",
"file": "corpus/2026-03-13-mamba-ssm.md",
"scraped_at": "2026-03-13T00:00:00Z",
"session": "2026-03-13-mamba-ssm",
"scraper": "firecrawl-port3003",
"has_figures": true,
"figure_count": 12,
"figure_urls": [
"https://arxiv.org/html/2312.00752v2/x1.png",
"https://arxiv.org/html/2312.00752v2/x2.png"
]
}
```
### Port 3003 vs Jina Reader: Empirical Comparison (arXiv)
**Validated on arXiv:2312.00752v2 (Mamba paper) — both scrapers running, same URL:**
| Scraper | Bytes | Lines | Words | Figures (absolute inline) | Math on GitHub |
| ------------------------ | ------ | ----- | ------ | ------------------------- | -------------------------------------- |
| Port 3003 (Firecrawl) | 99,104 | 1,267 | 13,182 | 13 ✅ | ❌ doubled Unicode+LaTeX, no `$...$` |
| Port 3002 (direct API) | 99,104 | 1,267 | 13,182 | 13 ✅ (identical to 3003) | ❌ doubled Unicode+LaTeX, no `$...$` |
| Jina Reader | 84,832 | 596 | 10,761 | 12 ✅ | ❌ doubled Unicode+LaTeX, no `$...$` |
| Pandoc from LaTeX source | — | — | — | via `\includegraphics` | ✅ `$inline$` + ` ```math ``` ` blocks |
**Verdict**: Firecrawl (port 3002/3003) gets **17% more bytes, 2.1× more lines, 22% more words, 1 extra figure** vs Jina. Port 3002 and 3003 produce identical markdown (3003 just wraps 3002 and saves to Caddy). **Both emit absolute inline figure URLs** — no URL reconstruction needed from either scraper.
**Note on the earlier session timeout**: The March 2026 session failure was machine downtime (littleblack was offline), not a routing issue. When littleblack is up, port 3003 reaches arxiv.org fine.
**Recommended arXiv workflow**:
1. Port 3003 (preferred) — more complete content, figures inline, saves to Caddy
2. Jina Reader (fallback when littleblack is down) — 17% less content but still gets absolute figure URLs
3. Probe loop to build `figure_urls` frontmatter catalog regardless of scraper used
4. For human-readable math on GitHub: Pandoc from arXiv LaTeX source (see below)
### Math Rendering: Empirically Validated Approaches
**Validated on arXiv:2312.00752v2 (Mamba paper), March 2026.**
#### Firecrawl/Jina Math Output: Unreadable on GitHub
Both Firecrawl (port 3002/3003) and Jina Reader extract math by doubling content — each equation appears as a Unicode render followed immediately by raw LaTeX source, packed into markdown table cells with `\displaystyle` prefixes and `\\bm{}` escaping. Example from the empirical test:
```
| | h′(t)\\displaystyle h^{\\prime}(t) | \=𝑨h(t)+𝑩x(t)\\displaystyle=\\bm{A}h(t)+\\bm{B}x(t) | | (1a) |
```
No `$...$` delimiters — **GitHub cannot render this as math**. The raw LaTeX portion is parseable by an LLM (equations are present), but the output is completely unreadable to humans on GitHub.
**For LLM consumption**: Firecrawl's doubled content is sufficient — the LaTeX source is embedded and an LLM can extract it.
**For human-readable GitHub rendering**: Use Pandoc from the arXiv LaTeX source tarball (see below).
#### Pandoc from arXiv LaTeX Source (Human-Readable Math)
Produces proper `$inline$` and ` ```math ``` ` display blocks that GitHub's MathJax/KaTeX renders natively:
```bash
ARXIV_ID="2312.00752"
# Download arXiv LaTeX source tarball
curl -L "https://arxiv.org/src/${ARXIV_ID}" -o "${ARXIV_ID}-src.tar.gz"
mkdir -p "${ARXIV_ID}-src"
tar xzf "${ARXIV_ID}-src.tar.gz" -C "${ARXIV_ID}-src/"
# Find main .tex entry point and section files
ls "${ARXIV_ID}-src/"*.tex
ls "${ARXIV_ID}-src/src/"*.tex 2>/dev/null # some papers put sections in src/
# Option A: Convert individual section files (safer — avoids macro parse errors)
pandoc "${ARXIV_ID}-src/src/background.tex" \
--to gfm+tex_math_dollars \
--wrap=none \
-o "${ARXIV_ID}-background.md"
# Option B: Convert full main.tex (may fail on custom macros like \iftoggle)
pandoc "${ARXIV_ID}-src/main.tex" \
--to gfm+tex_math_dollars \
--wrap=none \
-o "${ARXIV_ID}-pandoc.md"
```
Install: `brew install pandoc`. Works on any arXiv paper that publishes LaTeX source (most do).
**Pandoc output quality** (empirically validated):
- Inline math: `$x(t) \in \R \mapsto y(t) \in \R$` ✅ GitHub renders
- Display math: ` ```math\n\begin{align}\nh'(t) &= \A h(t) + \B x(t)\n\end{align}\n``` ` ✅ GitHub renders
- Custom macros (`\A`, `\B`, `\R`, `\dt`, `\dA`, `\dB`): ⚠️ **undefined in KaTeX** — macros pass through as-is and may partially fail on GitHub without the preamble's `\newcommand` definitions
**Handling custom macros**: Prepend the `\newcommand` block from `main.tex` preamble to the output:
````bash
# Extract custom macro definitions from preamble
grep '\\newcommand\|\\renewcommand\|\\def ' "${ARXIV_ID}-src/main.tex" > macros.tex
# Pandoc does not read preamble macros — include them explicitly in a math block at the top:
echo '```math' > preamble-block.md
cat macros.tex >> preamble-block.md
echo '```' >> preamble-block.md
cat preamble-block.md "${ARXIV_ID}-pandoc.md" > "${ARXIV_ID}-with-macros.md"
````
**Known Pandoc parse errors on arXiv LaTeX**:
| Error trigger | Cause | Workaround |
| -------------------- | ---------------------------------------------- | ----------------------------------------- |
| `\iftoggle{arxiv}` | Undefined toggle macro (etoolbox package) | Convert section files instead of main.tex |
| `\begin{figure*}` | Two-column figure environment breaks structure | Use `head -N` to avoid broken `\end` tags |
| `\bm{}`, `\mathbf{}` | Passes through — may not render in KaTeX | Check paper's macro file for mappings |
---
## Anti-Patterns
| # | Anti-Pattern | Why It Fails | Correct Approach |
| --- | --------------------------------------------- | ---------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------- |
| 1 | Using `@mendable/firecrawl-js` SDK | `jiti` dynamic imports break in Bun | Direct `fetch()` calls |
| 2 | Searching paywalled sites without `waitFor` | JS SPAs return empty shell | Use `waitFor: 3000` for IEEE, ACM DL |
| 3 | Setting depth > 5 | Exponential query explosion, diminishing returns | Cap at depth 5 (`clampDepth()`) |
| 4 | No timeout on `fetch()` | Hangs indefinitely on unreachable pages | Always use `AbortController` with 15s timeout |
| 5 | Not trimming long page content | Exceeds LLM context window | `trimToTokenLimit(text, 25_000)` per page |
| 6 | Aborting on partial failure | Loses all completed work | Log failures, continue with remaining queries |
| 7 | Not checking Firecrawl health first | Wastes time on queries that all fail | `GET /v1/health` or test search before starting |
| 8 | Saving only synthesis without raw originals | Loses source material, prevents re-analysis | Always persist raw Firecrawl markdown to corpus |
| 9 | Rewriting figure URLs to local relative paths | Relative paths like `./figures/x1.png` break on GitHub — images don't render | Keep absolute URLs inline in markdown body (``); catalog in frontmatter `figure_urls` list — see Section 6 |
---
## References
- [API Endpoint Reference](./references/api-endpoint-reference.md) — `/v1/search` and `/v1/scrape` contracts
- [Academic Paper Routing](./references/academic-paper-routing.md) — Decision tree for paper sources
- [Recursive Research Protocol](./references/recursive-research-protocol.md) — Step-by-step recursive pattern
- [Corpus Persistence Format](./references/corpus-persistence-format.md) — Raw content archival format + directory layout
- [Self-Hosted Operations](./references/self-hosted-operations.md) — Architecture, health checks, recovery
- [Self-Hosted Bootstrap Guide](./references/self-hosted-bootstrap-guide.md) — Fresh installation guide
- [Self-Hosted Best Practices](./references/self-hosted-best-practices.md) — Docker restart policies, monitoring
- [Self-Hosted Troubleshooting](./references/self-hosted-troubleshooting.md) — Symptom-based diagnosis and recovery
## Post-Execution Reflection
After this skill completes, check before closing:
1. **Did the command succeed?** — If not, fix the instruction or error table that caused the failure.
2. **Did parameters or output change?** — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.
3. **Was a workaround needed?** — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.
Only update if the issue is real and reproducible — not speculative.Related Skills
mql5-indicator-patterns
MQL5 indicator development patterns. TRIGGERS - MQL5 indicator, OnCalculate, indicator buffers, MetaTrader 5.
research-archival
Scrape AI research URLs, archive with frontmatter, create GitHub Issues with identity verification. TRIGGERS - scrape research, archive findings, save ChatGPT share, save Gemini research, research to issue.
gemini-deep-research
Run Gemini Deep Research via browser automation. Persistent Chrome on CDP port 9222. Use when user asks to research a topic with Gemini, run deep research, or wants comprehensive AI-powered research reports. TRIGGERS - Gemini research, deep research, research report, Gemini Deep Research
claude-code-proxy-patterns
Claude Code OAuth proxy patterns and anti-patterns for multi-provider model routing. TRIGGERS - proxy Claude Code, OAuth token Keychain, route Haiku to MiniMax, ANTHROPIC_BASE_URL, model routing proxy, claude-code-proxy, proxy-toggle, multi-provider setup, anthropic-beta oauth, proxy auth failure, go proxy, failover proxy, launchd proxy, proxy failover
voice-quality-audition
Audition Kokoro TTS voices to compare quality and grade. TRIGGERS - audition voices, kokoro voices, voice comparison, tts voice, voice quality, compare voices.
settings-and-tuning
Configure TTS voices, speed, timeouts, queue depth, and bot settings. TRIGGERS - configure tts, change voice, tts speed, queue depth, tts timeout, bot config, tune settings, adjust parameters.
full-stack-bootstrap
One-time bootstrap for Kokoro TTS engine, Telegram bot, and BotFather setup. TRIGGERS - setup tts, install kokoro, botfather, bootstrap tts-tg-sync, configure telegram bot, full stack setup.
diagnostic-issue-resolver
Diagnose and resolve TTS and Telegram bot issues. TRIGGERS - tts not working, bot not responding, kokoro error, audio not playing, lock stuck, telegram bot troubleshoot, diagnose issue.
component-version-upgrade
Upgrade Kokoro model, bot dependencies, or TTS components. TRIGGERS - upgrade kokoro, update model, upgrade bot, update dependencies, version bump, component update.
clean-component-removal
Remove TTS and Telegram sync components cleanly. TRIGGERS - uninstall tts, remove telegram bot, uninstall kokoro, clean tts, teardown, component removal.
send-message
Use when user wants to send a text message on Telegram as their personal account via MTProto, text someone, or message a contact by username, phone, or chat ID.
send-media
Use when user wants to send or upload a file, photo, video, voice note, or document on Telegram via their personal account.