infrastructure-search
Discovery utilities for academic literature. Currently exposes the `literature` submodule — Paperclip-style multi-source search across arXiv, Crossref, local JSON corpora, and (opt-in) the Paperclip API, with deterministic JSON caching, a `LiteratureClient` aggregator, normalised `Paper` records, and a CLI. Use when the user wants to find papers, build reading lists, populate references.bib from a query, or replay a prior search reproducibly. Designed to host additional discovery workflows without breaking the public API.
Best use case
infrastructure-search is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Discovery utilities for academic literature. Currently exposes the `literature` submodule — Paperclip-style multi-source search across arXiv, Crossref, local JSON corpora, and (opt-in) the Paperclip API, with deterministic JSON caching, a `LiteratureClient` aggregator, normalised `Paper` records, and a CLI. Use when the user wants to find papers, build reading lists, populate references.bib from a query, or replay a prior search reproducibly. Designed to host additional discovery workflows without breaking the public API.
Teams using infrastructure-search should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/search/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How infrastructure-search Compares
| Feature / Agent | infrastructure-search | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Discovery utilities for academic literature. Currently exposes the `literature` submodule — Paperclip-style multi-source search across arXiv, Crossref, local JSON corpora, and (opt-in) the Paperclip API, with deterministic JSON caching, a `LiteratureClient` aggregator, normalised `Paper` records, and a CLI. Use when the user wants to find papers, build reading lists, populate references.bib from a query, or replay a prior search reproducibly. Designed to host additional discovery workflows without breaking the public API.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Search Module
Discovery utilities for academic literature, modelled after the
agent-native abstractions of [Paperclip](https://paperclip.gxl.ai): every
backend produces normalised `Paper` records that downstream consumers
(citation export, manuscript synthesis, agent loops) can treat uniformly.
## `literature` — Multi-source literature search
```python
from infrastructure.search.literature import (
Paper, SearchQuery, SearchResult, merge_papers,
SearchBackend, LocalBackend, CrossrefBackend, ArxivBackend, PaperclipBackend,
LiteratureClient, SearchCache,
HttpClient, UrllibHttpClient, HttpResponse, BackendError,
)
```
### Search across arXiv + Crossref
```python
client = LiteratureClient([ArxivBackend(), CrossrefBackend(mailto="you@example.org")])
result = client.search(SearchQuery(text="protein language model fitness", max_results=20))
print(f"{len(result)} unique papers from {len(result.per_source_counts)} backends")
for paper in result.papers[:5]:
print(f" [{paper.score:.2f}] {paper.title} ({paper.year}) {paper.doi or paper.url}")
```
### Search a local JSON corpus (offline-friendly)
```python
backend = LocalBackend("data/curated_corpus.json")
result = LiteratureClient([backend]).search(SearchQuery(text="convex"))
```
Corpus format — either a list of `Paper` dicts or `{"papers": [...]}`:
```json
[
{
"id": "doi:10.1126/science.1213847",
"title": "Reproducible research in computational science",
"authors": ["Roger D Peng"],
"year": 2011,
"doi": "10.1126/science.1213847",
"venue": "Science", "venue_type": "journal"
}
]
```
### Search Paperclip (API key required)
```python
import os
backend = PaperclipBackend(api_key=os.environ["PAPERCLIP_API_KEY"])
result = LiteratureClient([backend]).search(
SearchQuery(text="GRPO hyperparameters", sources=["arxiv"], max_results=50)
)
```
### Cache results for reproducibility
```python
cache = SearchCache("output/search_cache", ttl_seconds=3600 * 24)
client = LiteratureClient([ArxivBackend(), CrossrefBackend()], cache=cache)
# First call hits the network and writes search_<hash>.json.
client.search(SearchQuery(text="adam optimizer"))
# Re-running the identical query is a deterministic file read.
client.search(SearchQuery(text="adam optimizer"))
```
### Merge / deduplicate results manually
```python
from infrastructure.search.literature import merge_papers
unique = merge_papers(result_a.papers + result_b.papers)
```
Deduplication priority: DOI → arXiv id → normalised (title, year). Higher
`score` wins; missing fields on the winner are filled from the loser
("union of evidence").
### CLI
```bash
# JSON to stdout
uv run python -m infrastructure.search.literature.cli search \
"scaling laws" --source arxiv,crossref --max-results 10
# Direct BibTeX to a file
uv run python -m infrastructure.search.literature.cli to-bibtex \
"GRPO hyperparameters" \
--source arxiv \
--output output/grpo_refs.bib
# Cached, offline-only over a local corpus
uv run python -m infrastructure.search.literature.cli search \
"convex" --source local --corpus data/corpus.json \
--cache-dir output/cache
```
## End-to-End: search → BibTeX → PDF
```python
from infrastructure.search.literature import (
LiteratureClient, SearchQuery, ArxivBackend, CrossrefBackend
)
from infrastructure.reference.citation import paper_to_bibentry, write_bibfile
from infrastructure.reference.citation.models import BibDatabase
client = LiteratureClient([ArxivBackend(), CrossrefBackend()])
result = client.search(SearchQuery(text="reproducible research", max_results=15))
db = BibDatabase()
for paper in result.papers:
db.add(paper_to_bibentry(paper))
write_bibfile("projects/my_project/manuscript/references.bib", db)
```
## Reliability Properties
* **Per-backend failure isolation**: a network outage in one backend records
an entry in `result.errors[name]` and leaves the rest of the search intact.
* **Deterministic caching**: `SearchCache` keys on the canonical query
identity; cached files are pretty-printed JSON, version-control friendly.
* **Year filters re-applied defensively** by the aggregator even when a
backend ignores them — protects downstream code.
## Related Modules
* [`infrastructure.reference.citation`](../reference/SKILL.md) — export side
of the literature workflow (BibTeX writer, parser, converter).
* [`infrastructure.publishing`](../publishing/SKILL.md) — APA / MLA / DOI
utilities for the resulting publications.Related Skills
infrastructure-validation
Skill for the validation infrastructure module providing PDF validation, markdown validation, output integrity checks, link verification, documentation audits, issue categorization, and repository scanning. Use when validating research outputs, checking document quality, running audits, or verifying cross-references.
infrastructure-steganography
Skill for the steganography infrastructure module providing QR code generation with dynamic mailto links, hash manifests, metadata payloads, and document-wide overlay processing. Use this module to insert opt-in cryptographic and steganographic provenance data onto PDFs.
infrastructure-skills
Programmatic discovery of first-party agent SKILL.md files under configured public repo roots (infrastructure, projects, docs/prompts, and .cursor/skills). Use when enumerating skills, validating .cursor/skill_manifest.json, writing docs/_generated/skills_index.md, checking docs/prompts workflow contracts, or wiring editor automation. Exposes discover_skills, write_skill_manifest, manifest_matches_discovery, and check_skill_contracts.
infrastructure-search-literature
Paperclip-style multi-source literature search across arXiv, Crossref, local JSON corpora, and (opt-in) the Paperclip API. Provides Paper/SearchQuery/SearchResult data models, a LiteratureClient aggregator with per-backend failure isolation, DOI/arXiv-aware deduplication via merge_papers, deterministic JSON caching via SearchCache, an HttpClient protocol for test injection, and a CLI (search/to-bibtex). Use when finding papers by topic, building reading lists, populating references.bib from a query, or replaying a prior search reproducibly.
infrastructure-scientific
Skill for the scientific infrastructure module providing numerical stability checks, performance benchmarking, scientific documentation generation, implementation validation, and module/workflow templates. Use when benchmarking functions, checking numerical stability, validating scientific implementations, or creating scientific module scaffolds.
infrastructure-reporting
Skill for the reporting infrastructure module providing pipeline reporting, error aggregation, executive summaries, dashboard generation, test reporting, and multi-project reports. Use when generating build reports, aggregating errors, creating visual dashboards, or producing executive summaries across projects.
infrastructure-rendering
Skill for the rendering infrastructure module providing multi-format output generation including PDF manuscripts, HTML web pages, Beamer/Reveal.js slides, and posters. Use when rendering research outputs, converting markdown to PDF, generating slides, or configuring LaTeX rendering.
infrastructure-reference-citation
BibTeX read/write/convert that matches the syntax/semantics of projects/template_code_project/manuscript/references.bib (consumed by Pandoc with --natbib -- see infrastructure/rendering/_pdf_combined_renderer.py). Provides BibEntry/BibDatabase models, parse_bibfile/render_database functions, paper_to_bibentry conversion from literature search results, generate_citation_key in the project's house style (firstauthorlastname+year+firsttitleword), LaTeX-special-character escape helpers, and a CLI (validate/format/convert). Use when reading or writing .bib files, exporting search results to BibTeX, or generating citation keys.
infrastructure-reference
Bibliographic-reference utilities for research projects. Read, write, and convert BibTeX entries that match the syntax/semantics of projects/template_code_project/manuscript/references.bib (consumed by Pandoc with --natbib during PDF render -- see infrastructure/rendering/_pdf_combined_renderer.py). Currently exposes the `citation` submodule (BibTeX I/O + Paper→BibEntry conversion); designed to host additional reference workflows (e.g. CSL-JSON export, ORCID lookups) without breaking the public API.
infrastructure-publishing
Skill for the publishing infrastructure module providing academic publishing workflows including BibTeX CLI citation generation, APA/MLA citation helper functions, DOI management, Zenodo publication, arXiv submission preparation, GitHub releases, and publication readiness validation. Use when publishing research, generating citations, minting DOIs, or preparing submissions.
infrastructure-prose
Prose analysis utilities for research manuscripts and prose-focused projects. Provides readability metrics (Flesch, Flesch-Kincaid, Gunning Fog), heading-outline structural analysis, editorial quality flags (passive voice, hedge words, citation density, long sentences), aggregate ManuscriptReport across a manuscript directory, and a CLI (metrics/outline/quality/report). Use when analyzing manuscripts for readability, building editorial dashboards, validating heading structure, extracting citation keys from prose, or wiring prose-quality gates into the pipeline.
infrastructure-project
Skill for the project management infrastructure module providing multi-project discovery, structure validation, and metadata extraction. Use when discovering active projects, validating project directory structure, or extracting project configuration metadata.