research-archival
Scrape AI research URLs, archive with frontmatter, create GitHub Issues with identity verification. TRIGGERS - scrape research, archive findings, save ChatGPT share, save Gemini research, research to issue.
Best use case
research-archival is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Scrape AI research URLs, archive with frontmatter, create GitHub Issues with identity verification. TRIGGERS - scrape research, archive findings, save ChatGPT share, save Gemini research, research to issue.
Teams using research-archival should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/research-archival/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How research-archival Compares
| Feature / Agent | research-archival | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Scrape AI research URLs, archive with frontmatter, create GitHub Issues with identity verification. TRIGGERS - scrape research, archive findings, save ChatGPT share, save Gemini research, research to issue.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Research Archival
Scrape AI research conversations (ChatGPT, Gemini, Claude) and web pages, archive them as markdown files with YAML frontmatter, and create cross-referenced GitHub Issues — with mandatory identity verification at every step.
> **Self-Evolving Skill**: This skill improves through use. If instructions are wrong, parameters drifted, or a workaround was needed — fix this file immediately, don't defer. Only update for real, reproducible issues.
## FIRST - TodoWrite Task Templates
**MANDATORY**: Select and load the appropriate template before any archival work.
### Template A - Full Archival (scrape + save + issue)
```
1. Identity preflight — verify GH_ACCOUNT or resolve via curl /user
2. Scrape URL — route to Firecrawl or Jina per url-routing.md
3. Save to file — YYYY-MM-DD-{slug}-{source_type}.md with frontmatter
4. Survey labels — gh label list, reuse existing, max 3-6
5. Create GitHub Issue — use --body with heredoc or --body-file
6. Update frontmatter — add github_issue_url and github_issue_number
7. Post canonical backlink comment on Issue
```
### Template B - Save Only (no issue)
```
1. Identity preflight (still required for consistency)
2. Scrape URL — route to Firecrawl or Jina per url-routing.md
3. Save to file — YYYY-MM-DD-{slug}-{source_type}.md with frontmatter
```
### Template C - Issue Only (file already exists)
```
1. Identity preflight
2. Read existing file frontmatter
3. Survey labels — gh label list, reuse existing, max 3-6
4. Create GitHub Issue — use --body with heredoc or --body-file
5. Update file frontmatter with issue cross-reference
6. Post canonical backlink comment on Issue
```
---
## Identity Preflight (MANDATORY — Step 0)
**MUST execute before any `gh` write command. Non-negotiable.**
The `gh-repo-identity-guard.mjs` PreToolUse hook provides a safety net, but this skill performs its own check as defense-in-depth.
### Resolution Order
1. **Fast-path** — `GH_ACCOUNT` env var (set by mise per-directory)
2. **Token filename** — scan `~/.claude/.secrets/gh-token-*` for single base match
3. **API call** — `curl -sH "Authorization: token $GH_TOKEN" https://api.github.com/user`
### Verification
```bash
/usr/bin/env bash << 'IDENTITY_EOF'
# Resolve authenticated user
if [ -n "$GH_ACCOUNT" ]; then
AUTH_USER="$GH_ACCOUNT"
AUTH_SOURCE="GH_ACCOUNT"
else
AUTH_USER=$(curl -sf --max-time 5 -H "Authorization: token $GH_TOKEN" \
https://api.github.com/user 2>/dev/null | grep -o '"login":"[^"]*"' | cut -d'"' -f4)
AUTH_SOURCE="API /user"
fi
# Resolve target repo owner
REPO_OWNER=$(git remote get-url origin 2>/dev/null | sed -n 's|.*github\.com[:/]\([^/]*\)/.*|\1|p')
echo "Authenticated as: $AUTH_USER (via $AUTH_SOURCE)"
echo "Target repo owner: $REPO_OWNER"
if [ "$AUTH_USER" != "$REPO_OWNER" ]; then
echo ""
echo "MISMATCH — do NOT proceed with gh write commands"
echo "Fix: export GH_TOKEN=\$(cat ~/.claude/.secrets/gh-token-$REPO_OWNER)"
exit 1
fi
echo "Identity verified — safe to proceed"
IDENTITY_EOF
```
**BLOCK if mismatch** — display diagnostic and do NOT continue to any `gh` write operation.
---
## Scraping Workflow
Route scrape requests based on URL pattern. See [url-routing.md](./references/url-routing.md) for full details.
### Decision Tree
```
URL contains chatgpt.com/share/
→ Jina Reader (https://r.jina.ai/{URL})
→ Use curl (not WebFetch — it summarizes instead of returning raw)
URL contains gemini.google.com/share/
→ Firecrawl (JS-heavy SPA)
→ Preflight: ping -c1 -W2 littleblack
URL contains claude.ai/artifacts/ or is a static web page
→ Jina Reader (https://r.jina.ai/{URL})
→ Use WebFetch or curl
```
### Firecrawl Scrape (with Health Check + Auto-Revival)
**CRITICAL**: Firecrawl containers can show "Up" in `docker ps` while internal processes are dead (RAM/CPU overload crashes the worker inside the container). Always perform a deep health check before scraping.
```bash
/usr/bin/env bash << 'SCRAPE_EOF'
set -euo pipefail
# Step 1: Check Tailscale connectivity (littleblack primary, ZeroTier legacy at 172.25.236.1)
if ! ping -c1 -W2 littleblack >/dev/null 2>&1; then
echo "ERROR: Firecrawl host unreachable. Check Tailscale: tailscale status"
exit 1
fi
# Step 2: Deep health check — test actual API response, not just container status
# Port 3003 (wrapper) may accept TCP but return empty if Firecrawl API (3002) is dead inside
HTTP_CODE=$(ssh littleblack 'curl -sf -o /dev/null -w "%{http_code}" --max-time 10 \
-X POST http://localhost:3002/v1/scrape \
-H "Content-Type: application/json" \
-d "{\"url\":\"https://example.com\",\"formats\":[\"markdown\"]}"' 2>/dev/null || echo "000")
if [ "$HTTP_CODE" = "000" ] || [ "$HTTP_CODE" = "502" ] || [ "$HTTP_CODE" = "503" ]; then
echo "WARNING: Firecrawl API unhealthy (HTTP $HTTP_CODE). Attempting revival..."
# Step 2a: Check docker logs for WORKER STALLED (RAM/CPU overload)
ssh littleblack 'docker logs firecrawl-api-1 --tail 20 2>&1 | grep -i "stalled\|error\|exit" || true'
# Step 2b: Restart the critical containers
ssh littleblack 'docker restart firecrawl-api-1 firecrawl-playwright-service-1' 2>/dev/null
echo "Containers restarted. Waiting 20s for API to initialize..."
sleep 20
# Step 2c: Verify recovery
HTTP_CODE=$(ssh littleblack 'curl -sf -o /dev/null -w "%{http_code}" --max-time 10 \
-X POST http://localhost:3002/v1/scrape \
-H "Content-Type: application/json" \
-d "{\"url\":\"https://example.com\",\"formats\":[\"markdown\"]}"' 2>/dev/null || echo "000")
if [ "$HTTP_CODE" = "000" ] || [ "$HTTP_CODE" = "502" ] || [ "$HTTP_CODE" = "503" ]; then
echo "ERROR: Firecrawl still unhealthy after restart (HTTP $HTTP_CODE)."
echo "Manual intervention needed. Try: ssh littleblack 'cd ~/firecrawl && docker compose up -d --force-recreate'"
echo "Falling back to Jina Reader: https://r.jina.ai/${URL}"
exit 1
fi
echo "Firecrawl recovered successfully."
fi
# Step 3: Scrape via wrapper
CONTENT=$(curl -s --max-time 120 "http://littleblack:3003/scrape?url=${URL}&name=${SLUG}")
if [ -z "$CONTENT" ]; then
echo "ERROR: Scrape returned empty. Try Jina fallback: https://r.jina.ai/${URL}"
exit 1
fi
echo "$CONTENT"
SCRAPE_EOF
```
### Known Failure Mode: Container "Up" But Processes Dead
**Symptom**: `docker ps` shows containers with status "Up 4 days" but `curl localhost:3002` returns connection reset.
**Root cause**: Firecrawl worker exhausts RAM/CPU (observed: `cpuUsage=0.998, memoryUsage=0.858`). Internal Node.js processes exit but Docker container stays alive because the entrypoint shell is still running.
**Diagnosis**:
```bash
ssh bigblack 'docker logs firecrawl-api-1 --tail 50 2>&1 | grep -E "STALLED|cpuUsage|exit"'
# Look for: WORKER STALLED {"cpuUsage":0.998,"memoryUsage":0.858}
```
**Fix**: `docker restart` (not `docker compose restart` — may require permissions to compose directory):
```bash
ssh bigblack 'docker restart firecrawl-api-1 firecrawl-playwright-service-1'
sleep 20 # Wait for API initialization
# Verify:
ssh bigblack 'curl -s -o /dev/null -w "%{http_code}" http://localhost:3002/v1/scrape'
```
---
## File Saving
### Naming Convention
```
YYYY-MM-DD-{slug}-{source_type}.md
```
- `slug` — kebab-case summary (max 50 chars)
- `source_type` — from enum: `chatgpt`, `gemini`, `claude`, `web`
**Default location**: `docs/research/` in the current project.
### YAML Frontmatter
See [frontmatter-schema.md](./references/frontmatter-schema.md) for the full field contract.
```yaml
---
source_url: https://chatgpt.com/share/...
source_type: chatgpt-share
scraped_at: "2026-02-09T18:30:00Z"
model_name: gpt-4o
custom_gpt_name: Cosmo
claude_code_uuid: SESSION_UUID
github_issue_url: ""
github_issue_number: ""
---
```
Leave `github_issue_url` and `github_issue_number` empty — update after Issue creation.
---
## GitHub Issue Creation
### Label Survey
Survey existing labels first — reuse preferred, create only when concept is genuinely novel.
```bash
gh label list --repo owner/repo --limit 100
```
**Policy**: Max 3-6 labels per issue. Common labels: `research`, `ai-output`, `chatgpt`, `gemini`, `archival`.
### Create Issue
Use `--body` with heredoc for inline composition, or `--body-file` for very large content.
```bash
/usr/bin/env bash << 'ISSUE_EOF'
# Write body to temp file
cat > "/tmp/issue-body-${SLUG}.md" << 'BODY_EOF'
## Summary
Brief description of the archived research content.
## Source
- **URL**: SOURCE_URL
- **Type**: source_type
- **Model**: model_name
- **Scraped**: scraped_at
## Key Findings
- Finding 1
- Finding 2
## Archived File
`docs/research/FILENAME.md`
BODY_EOF
# Create issue
gh issue create \
--repo owner/repo \
--title "Research: descriptive title here" \
--body-file "/tmp/issue-body-${SLUG}.md" \
--label "research,ai-output"
# Clean up
rm -f "/tmp/issue-body-${SLUG}.md"
ISSUE_EOF
```
### Update Frontmatter
After issue creation, update the archived file's frontmatter with the issue URL and number.
---
## Canonical Backlink Comment
Post a comment on the Issue linking back to the archived file:
```
**Archived**: `docs/research/YYYY-MM-DD-slug-source_type.md`
Scraped: 2026-02-09T18:30:00Z
Source: [chatgpt-share](https://chatgpt.com/share/...)
Session: SESSION_UUID
```
---
## Post-Change Checklist
After modifying THIS skill:
1. [ ] YAML frontmatter valid (no colons in description)
2. [ ] Trigger keywords current in description
3. [ ] All `./references/` links resolve
4. [ ] Identity preflight section remains FIRST in workflow
5. [ ] Append changes to [evolution-log.md](./references/evolution-log.md)
6. [ ] Validate: `uv run plugins/plugin-dev/scripts/skill-creator/quick_validate.py plugins/gh-tools/skills/research-archival`
7. [ ] Validate links: `bun run plugins/plugin-dev/scripts/validate-links.ts plugins/gh-tools/skills/research-archival`
---
## Troubleshooting
| Issue | Cause | Fix |
| ----------------------------- | ---------------------------------- | ------------------------------------------------------------------------- |
| Wrong account posting | GH_TOKEN mismatch | Check `mise env \| grep GH_TOKEN`, verify `GH_ACCOUNT` |
| Body exceeds 65536 chars | GitHub API limit | Split across issue body + first comment |
| Firecrawl unreachable | Tailscale down | `tailscale ping bigblack`, check `tailscale status` |
| Firecrawl "Up" but dead | Container alive, processes crashed | `docker restart firecrawl-api-1 firecrawl-playwright-service-1`, wait 20s |
| Firecrawl WORKER STALLED | RAM/CPU overload (>85% mem) | Same as above; check `docker logs firecrawl-api-1 --tail 50` |
| Scrape returns empty | JS-heavy page timeout | Increase Firecrawl timeout, try Jina fallback |
| Jina returns login page shell | Gemini login wall (not rendered) | Must use Firecrawl for `gemini.google.com/share/*` URLs |
| mise parse error | Stale .mise.toml syntax | Run `mise doctor`, check `[hooks.enter]` syntax |
| Identity guard blocks | Non-owner account | `export GH_TOKEN=$(cat ~/.claude/.secrets/gh-token-OWNER)` |
## References
- [Frontmatter Schema](./references/frontmatter-schema.md) — YAML field contract
- [URL Routing](./references/url-routing.md) — Scraper routing table
- [Evolution Log](./references/evolution-log.md) — Change history
## Post-Execution Reflection
After this skill completes, check before closing:
1. **Did the command succeed?** — If not, fix the instruction or error table that caused the failure.
2. **Did parameters or output change?** — If the underlying tool's interface drifted, update Usage examples and Parameters table to match.
3. **Was a workaround needed?** — If you had to improvise (different flags, extra steps), update this SKILL.md so the next invocation doesn't need the same workaround.
Only update if the issue is real and reproducible — not speculative.Related Skills
gemini-deep-research
Run Gemini Deep Research via browser automation. Persistent Chrome on CDP port 9222. Use when user asks to research a topic with Gemini, run deep research, or wants comprehensive AI-powered research reports. TRIGGERS - Gemini research, deep research, research report, Gemini Deep Research
firecrawl-research-patterns
Programmatic Firecrawl usage, self-hosted operations, academic paper routing, recursive deep research, and raw corpus persistence. TRIGGERS - firecrawl search, firecrawl scrape, academic paper, arxiv, deep research, recursive search, research pattern, corpus persistence, firecrawl, self-hosted scraping, web scrape, scraper wrapper, littleblack, Tailscale scraping.
voice-quality-audition
Audition Kokoro TTS voices to compare quality and grade. TRIGGERS - audition voices, kokoro voices, voice comparison, tts voice, voice quality, compare voices.
settings-and-tuning
Configure TTS voices, speed, timeouts, queue depth, and bot settings. TRIGGERS - configure tts, change voice, tts speed, queue depth, tts timeout, bot config, tune settings, adjust parameters.
full-stack-bootstrap
One-time bootstrap for Kokoro TTS engine, Telegram bot, and BotFather setup. TRIGGERS - setup tts, install kokoro, botfather, bootstrap tts-tg-sync, configure telegram bot, full stack setup.
diagnostic-issue-resolver
Diagnose and resolve TTS and Telegram bot issues. TRIGGERS - tts not working, bot not responding, kokoro error, audio not playing, lock stuck, telegram bot troubleshoot, diagnose issue.
component-version-upgrade
Upgrade Kokoro model, bot dependencies, or TTS components. TRIGGERS - upgrade kokoro, update model, upgrade bot, update dependencies, version bump, component update.
clean-component-removal
Remove TTS and Telegram sync components cleanly. TRIGGERS - uninstall tts, remove telegram bot, uninstall kokoro, clean tts, teardown, component removal.
send-message
Use when user wants to send a text message on Telegram as their personal account via MTProto, text someone, or message a contact by username, phone, or chat ID.
send-media
Use when user wants to send or upload a file, photo, video, voice note, or document on Telegram via their personal account.
search-messages
Use when user wants to search for messages across all Telegram chats or within a specific chat, find old messages by text, or look up Telegram message history filtered by sender.
pin-message
Use when user wants to pin or unpin a message in a Telegram chat, group, or channel, or manage pinned messages.