corpus-export
Package corpus subsets as distribution archives. Select papers by cluster, topic, REF range, or custom filter; bundle PDFs, analysis docs, citation sidecars, web sources, and BibTeX into a tar.gz with manifest.
Best use case
corpus-export is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
It is a strong fit for teams already working in Codex.
Package corpus subsets as distribution archives. Select papers by cluster, topic, REF range, or custom filter; bundle PDFs, analysis docs, citation sidecars, web sources, and BibTeX into a tar.gz with manifest.
Teams using corpus-export should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/corpus-export/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How corpus-export Compares
| Feature / Agent | corpus-export | Standard Approach |
|---|---|---|
| Platform Support | Codex | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Package corpus subsets as distribution archives. Select papers by cluster, topic, REF range, or custom filter; bundle PDFs, analysis docs, citation sidecars, web sources, and BibTeX into a tar.gz with manifest.
Which AI agents support this skill?
This skill is designed for Codex.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Cursor vs Codex for AI Workflows
Compare Cursor and Codex for AI coding workflows, repository assistance, debugging, refactoring, and reusable developer skills.
AI Agent for Product Research
Browse AI agent skills for product research, competitive analysis, customer discovery, and structured product decision support.
SKILL.md Source
# Corpus Export Package corpus subsets as distribution archives. Selects papers by cluster, topic, REF range, or custom filter and bundles all artifacts (PDF, analysis doc, citation sidecar, web source, BibTeX) into a portable archive with manifest. ## Triggers - "export the corpus" - "package papers for distribution" - "create a distribution archive" - "export agentic canon" - "corpus export" - `/corpus-export` ## Parameters ### Selection (one required) #### `--cluster <name>` Select all papers in a named cluster (from `/research-gap-detect`). ```bash /corpus-export --cluster "Agentic Canon" ``` #### `--refs <range>` Explicit REF range or list. Supports ranges, multi-ranges, and individual IDs. ```bash /corpus-export --refs REF-016:REF-024,REF-121 /corpus-export --refs REF-016,REF-018,REF-024 ``` #### `--topic <name>` Select all papers tagged with a specific topic. ```bash /corpus-export --topic "GUI Agents" ``` #### `--filter <expr>` Custom filter expression (frontmatter field comparisons). ```bash /corpus-export --filter "year>=2023 AND incoming>=10" /corpus-export --filter "grade=High AND tag:reproducibility" ``` ### Options #### `--output <path>` (optional) Output archive path. Default: `.aiwg/research/exports/corpus-<selector>-<date>.tar.gz`. #### `--format tar.gz|zip` (optional) Archive format. Default: `tar.gz`. #### `--include` (optional, repeatable) Artifact types to include. Defaults: `pdf,analysis,citations,bibtex`. Available: `pdf`, `text`, `web`, `analysis`, `citations`, `bibtex`, `metadata`, `provenance`. #### `--dry-run` (optional) List what would be included without creating the archive. ## Execution Flow ### Phase 1: Selection Resolve the selection criteria to a list of REF-XXX identifiers: - `--cluster`: look up cluster in citation-network index, return member REFs - `--refs`: parse range expression - `--topic`: scan findings frontmatter for matching `tags` - `--filter`: evaluate expression against frontmatter Report resolved selection: ``` Selection: "Agentic Canon" cluster Papers: 17 (REF-001, REF-016, REF-018, REF-024, ...) ``` ### Phase 2: Artifact Gathering For each selected REF, gather the configured artifact types from canonical locations: ``` REF-016: ✓ PDF: sources/pdfs/full/REF-016-autogen.pdf (2.4 MB) ✓ Analysis: findings/REF-016-autogen.md (287 lines) ✓ Citations: documentation/citations/REF-016.md (43 outgoing, 12 incoming) ✓ BibTeX: citations/bibtex/REF-016.bib ✗ Web: no web source (PDF primary) ✓ Metadata: sources/metadata/REF-016.yaml ``` Flag missing artifacts: ``` REF-299: ✗ PDF: MISSING (acquisition failed) ✓ Analysis: findings/REF-299-stub.md (22 lines — STUB) ... ``` ### Phase 3: Manifest Generation Write a `MANIFEST.md` to the archive root describing the export: ```markdown # Corpus Export Manifest **Date**: 2026-04-13 **Selector**: --cluster "Agentic Canon" **Papers**: 17 **Total size**: 48.3 MB ## Contents | REF | Title | Year | GRADE | PDF | Analysis | Citations | |-----|-------|------|-------|-----|----------|-----------| | REF-016 | AutoGen | 2023 | High | ✓ | 287 lines | 43/12 | | REF-018 | Multi-Agent Debate | 2024 | High | ✓ | 312 lines | 28/17 | ... ## Missing Artifacts - REF-299: PDF missing (acquisition failed) - REF-312: Analysis doc is a skeleton (<40 lines) ## Provenance Generated by `corpus-export` v1.0 from corpus at: - Fixity manifest: .aiwg/research/fixity-manifest.json (checksum: abc123...) - Citation graph: indices/citation-network.md (generated 2026-04-13T10:00Z) ``` ### Phase 4: Archive Creation Create the archive with structure: ``` corpus-agentic-canon-2026-04-13.tar.gz ├── MANIFEST.md ├── pdfs/ │ ├── REF-016-autogen.pdf │ ├── REF-018-multi-agent-debate.pdf │ └── ... ├── findings/ │ ├── REF-016-autogen.md │ └── ... ├── citations/ │ ├── REF-016.md │ └── ... ├── bibtex/ │ ├── REF-016.bib │ └── all.bib # concatenated bibliography └── README.md # extraction + usage instructions ``` ### Phase 5: Report ``` Corpus Export Complete ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Selector: --cluster "Agentic Canon" Papers selected: 17 Artifacts bundled: 68 files Missing artifacts: 2 (reported in MANIFEST.md) Archive: .aiwg/research/exports/corpus-agentic-canon-2026-04-13.tar.gz Size: 48.3 MB SHA-256: abc123def456... Contents: 17 PDFs (45.1 MB) 17 analysis docs (1.2 MB) 17 citation sidecars (0.8 MB) 17 BibTeX entries + all.bib (50 KB) 1 MANIFEST.md (4 KB) 1 README.md (2 KB) ``` ## Archive Use Cases ### Research sharing Share a cluster with collaborators without sharing the entire corpus. ```bash /corpus-export --cluster "Agentic Canon" ``` ### Snapshot for publication Package the corpus state referenced by a paper for reproducibility. ```bash /corpus-export --refs REF-016:REF-024 --include pdf,analysis,citations,provenance ``` ### Topic digest Export everything on a specific topic for a focused review. ```bash /corpus-export --topic "Evaluation" --filter "year>=2023" ``` ### Quality subset Export only high-quality sources. ```bash /corpus-export --filter "grade=High" ``` ## Integration Points | Component | Relationship | |-----------|-------------| | `research-gap-detect` | Provides `--cluster` names | | `corpus-index-build` | Provides topic and metadata for selection | | `research-quality-audit` | Flags missing/skeleton artifacts in manifest | | `research-cite` | Generates BibTeX entries bundled in export | | Media curator `/acquire` | Source of PDF files packaged into export | ## Examples ```bash # Export a named cluster /corpus-export --cluster "Agentic Canon" # Export a REF range /corpus-export --refs REF-016:REF-024,REF-121 # Export by topic /corpus-export --topic "GUI Agents" # Filter: recent high-grade papers with many citations /corpus-export --filter "year>=2023 AND grade=High AND incoming>=10" # Preview without creating archive /corpus-export --cluster "Agentic Canon" --dry-run # Minimal export (analysis docs only) /corpus-export --topic "Reproducibility" --include analysis,citations # Custom output path /corpus-export --refs REF-001:REF-100 --output /tmp/first-100.tar.gz ``` ## References - @$AIWG_ROOT/agentic/code/frameworks/research-complete/skills/research-gap-detect/SKILL.md — Provides cluster names - @$AIWG_ROOT/agentic/code/frameworks/research-complete/skills/corpus-index-build/SKILL.md — Provides topic/metadata indices - @$AIWG_ROOT/agentic/code/frameworks/research-complete/skills/research-quality-audit/SKILL.md — Flags missing artifacts - @$AIWG_ROOT/agentic/code/frameworks/research-complete/skills/research-cite/SKILL.md — Generates BibTeX - @$AIWG_ROOT/docs/integrations/media-curator-to-research-handoff.md — Source acquisition contract
Related Skills
export
Export media collection to platform-specific formats (Plex, Jellyfin, MPD, mobile, archival)
corpus-snapshot
Generate a comprehensive corpus snapshot report from template, computing all metrics (dimensions, topology, degree distribution, delta from previous) and assisting with analysis sections (clusters, chains, gaps).
corpus-index-build
Build graph indices (by-topic, by-year, authors, citation-network) from corpus state using definitions in .aiwg/config.yaml. Replaces manual 3-agent dispatch with a single command.
corpus-health
Report on research corpus health, completeness, and integrity
aiwg-orchestrate
Route structured artifact work to AIWG workflows via MCP with zero parent context cost
venv-manager
Create, manage, and validate Python virtual environments. Use for project isolation and dependency management.
pytest-runner
Execute Python tests with pytest, supporting fixtures, markers, coverage, and parallel execution. Use for Python test automation.
vitest-runner
Execute JavaScript/TypeScript tests with Vitest, supporting coverage, watch mode, and parallel execution. Use for JS/TS test automation.
eslint-checker
Run ESLint for JavaScript/TypeScript code quality and style enforcement. Use for static analysis and auto-fixing.
repo-analyzer
Analyze GitHub repositories for structure, documentation, dependencies, and contribution patterns. Use for codebase understanding and health assessment.
pr-reviewer
Review GitHub pull requests for code quality, security, and best practices. Use for automated PR feedback and approval workflows.
YouTube Acquisition
yt-dlp patterns for acquiring content from YouTube and video platforms