repository-harvesting-guide
Harvest metadata from open repositories using OAI-PMH protocol
Best use case
repository-harvesting-guide is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Harvest metadata from open repositories using OAI-PMH protocol
Teams using repository-harvesting-guide should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/repository-harvesting-guide/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How repository-harvesting-guide Compares
| Feature / Agent | repository-harvesting-guide | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Harvest metadata from open repositories using OAI-PMH protocol
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Repository Harvesting Guide
A skill for harvesting metadata from open access repositories using the OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) protocol. Covers protocol fundamentals, building harvesters in Python, handling resumption tokens for large collections, metadata format parsing (Dublin Core, MARC, METS), selective harvesting by date and set, and integrating harvested data into research workflows.
## OAI-PMH Protocol Fundamentals
### What Is OAI-PMH
OAI-PMH is a standardized protocol that allows metadata to be harvested from repository systems. It is the backbone of library interoperability and is supported by virtually every institutional repository, preprint server, and digital library worldwide.
```
OAI-PMH Architecture:
Data Providers (repositories):
- Expose metadata through a standardized HTTP interface
- Must support Dublin Core as minimum metadata format
- May support additional formats (MARC, MODS, DataCite, etc.)
- Examples: arXiv, PubMed Central, DSpace repositories,
EPrints, institutional repositories
Service Providers (harvesters):
- Send HTTP requests to data providers
- Collect, aggregate, and index metadata
- Build search services, union catalogs, analytics
- Examples: BASE (Bielefeld), CORE, OpenDOAR
Protocol Version: 2.0 (current, since 2002)
Transport: HTTP GET or POST
Response format: XML
Base URL example: https://arxiv.org/oai2
```
### Six OAI-PMH Verbs
```
OAI-PMH defines exactly six request types (verbs):
1. Identify
Purpose: Describe the repository
URL: baseURL?verb=Identify
Returns: repository name, admin email, earliest datestamp,
granularity, compression support
2. ListMetadataFormats
Purpose: List available metadata formats
URL: baseURL?verb=ListMetadataFormats
Returns: format prefixes (oai_dc, marc21, datacite, etc.)
Optional: identifier parameter to check formats for one record
3. ListSets
Purpose: List available sets (collections/categories)
URL: baseURL?verb=ListSets
Returns: set names and specs for selective harvesting
Example sets: physics:hep-th, cs:AI, math:AG
4. ListIdentifiers
Purpose: List record identifiers (headers only, no metadata)
URL: baseURL?verb=ListIdentifiers&metadataPrefix=oai_dc
Optional: from, until, set parameters
Returns: identifiers, datestamps, set memberships
5. ListRecords
Purpose: Harvest full metadata records
URL: baseURL?verb=ListRecords&metadataPrefix=oai_dc
Optional: from, until, set parameters
Returns: complete metadata records in requested format
6. GetRecord
Purpose: Retrieve a single record by identifier
URL: baseURL?verb=GetRecord&identifier=oai:arxiv:2301.00001
&metadataPrefix=oai_dc
Returns: one complete metadata record
```
## Building a Harvester in Python
### Basic Harvester
```python
import requests
import xml.etree.ElementTree as ET
import time
OAI_NS = "http://www.openarchives.org/OAI/2.0/"
DC_NS = "http://purl.org/dc/elements/1.1/"
def harvest_records(base_url, metadata_prefix="oai_dc",
from_date=None, until_date=None,
set_spec=None):
"""
Harvest all records from an OAI-PMH endpoint.
Handles resumption tokens for paginated results.
Args:
base_url: OAI-PMH base URL
metadata_prefix: metadata format (default: oai_dc)
from_date: selective harvest start (YYYY-MM-DD)
until_date: selective harvest end (YYYY-MM-DD)
set_spec: restrict to a specific set
"""
params = {
"verb": "ListRecords",
"metadataPrefix": metadata_prefix,
}
if from_date:
params["from"] = from_date
if until_date:
params["until"] = until_date
if set_spec:
params["set"] = set_spec
all_records = []
request_count = 0
while True:
response = requests.get(base_url, params=params, timeout=30)
response.raise_for_status()
request_count += 1
root = ET.fromstring(response.content)
# Parse records from this page
records = root.findall(
f".//{{{OAI_NS}}}record"
)
for record in records:
parsed = parse_dublin_core(record)
if parsed:
all_records.append(parsed)
# Check for resumption token
token_elem = root.find(
f".//{{{OAI_NS}}}resumptionToken"
)
if token_elem is not None and token_elem.text:
params = {
"verb": "ListRecords",
"resumptionToken": token_elem.text,
}
# Polite delay between requests
time.sleep(2)
else:
break
print(f"Harvested {len(all_records)} records "
f"in {request_count} requests")
return all_records
def parse_dublin_core(record_element):
"""
Parse a Dublin Core metadata record into a dictionary.
"""
header = record_element.find(f"{{{OAI_NS}}}header")
metadata = record_element.find(f"{{{OAI_NS}}}metadata")
if header is None or metadata is None:
return None
# Check if record is deleted
status = header.get("status", "")
if status == "deleted":
return None
identifier = header.findtext(f"{{{OAI_NS}}}identifier", "")
datestamp = header.findtext(f"{{{OAI_NS}}}datestamp", "")
dc = metadata.find(f".//{{{DC_NS}}}../")
result = {
"oai_identifier": identifier,
"datestamp": datestamp,
"title": find_dc_text(metadata, "title"),
"creator": find_dc_all(metadata, "creator"),
"subject": find_dc_all(metadata, "subject"),
"description": find_dc_text(metadata, "description"),
"date": find_dc_text(metadata, "date"),
"type": find_dc_text(metadata, "type"),
"identifier": find_dc_all(metadata, "identifier"),
"language": find_dc_text(metadata, "language"),
"rights": find_dc_text(metadata, "rights"),
}
return result
def find_dc_text(metadata, element_name):
"""Find first Dublin Core element text."""
elem = metadata.find(f".//{{{DC_NS}}}{element_name}")
return elem.text if elem is not None else ""
def find_dc_all(metadata, element_name):
"""Find all values of a Dublin Core element."""
elems = metadata.findall(f".//{{{DC_NS}}}{element_name}")
return [e.text for e in elems if e.text]
```
## Selective Harvesting
### By Date Range
```
Incremental harvesting strategy:
First harvest: Get everything
from_date = None (or repository's earliestDatestamp)
until_date = today
Subsequent harvests: Get only new/modified records
from_date = last_harvest_date
until_date = today
Date granularity:
- Day-level: YYYY-MM-DD (most common)
- Second-level: YYYY-MM-DDThh:mm:ssZ (some repositories)
- Check the Identify response for supported granularity
Important: OAI-PMH datestamps reflect the date the METADATA
was last modified, not the publication date. A record edited
yesterday to fix a typo will appear in a harvest with
from=yesterday, even if the paper was published in 2015.
```
### By Set (Collection)
```
Common set structures by repository type:
arXiv:
physics, physics:hep-th, cs, cs:AI, math, math:AG, etc.
DSpace repositories:
com_12345_1 (community), col_12345_2 (collection)
Hierarchical: department -> collection
PubMed Central:
By journal: pmc-journal-name
By funder: pmc-funder-name
Strategy:
1. Call ListSets to see available sets
2. Identify sets relevant to your research topic
3. Harvest only those sets to reduce data volume
4. Store the set membership for each record
```
## Data Quality and Deduplication
### Common Quality Issues
```
Quality problems in harvested metadata:
1. Duplicate records:
- Same paper in multiple repositories
- Same paper in multiple sets within one repository
- Solution: Deduplicate by DOI, then by title similarity
2. Incomplete metadata:
- Missing abstracts (very common)
- Missing author identifiers
- Missing dates or using inconsistent date formats
- Solution: Enrich with Crossref or OpenAlex lookups
3. Encoding issues:
- Non-UTF-8 characters in older repositories
- HTML entities in text fields
- Solution: Normalize encoding, strip HTML tags
4. Inconsistent formats:
- Dates as "2023", "2023-01", "2023-01-15", "January 2023"
- Author names as "Smith, John" vs "John Smith" vs "J. Smith"
- Solution: Parse and normalize to canonical formats
```
## Notable OAI-PMH Endpoints
```
Major repositories with OAI-PMH support:
arXiv: https://export.arxiv.org/oai2
PubMed Central: https://www.ncbi.nlm.nih.gov/pmc/oai/oai.cgi
Europeana: https://oai.europeana.eu/oai
HAL (France): https://api.archives-ouvertes.fr/oai/hal
DBLP: https://dblp.org/oai
CiteSeerX: https://citeseerx.ist.psu.edu/oai2
To find more endpoints:
- OpenDOAR directory: https://v2.sherpa.ac.uk/opendoar/
- ROAR (Registry of Open Access Repositories)
- BASE (Bielefeld Academic Search Engine) source list
```
OAI-PMH harvesting remains the most reliable method for building comprehensive metadata collections from open repositories. While newer APIs like ResourceSync and Signposting offer richer functionality, OAI-PMH's universal adoption and simplicity make it the practical choice for most academic metadata collection tasks.Related Skills
thuthesis-guide
Write Tsinghua University theses using the ThuThesis LaTeX template
thesis-writing-guide
Templates, formatting rules, and strategies for thesis and dissertation writing
thesis-template-guide
Set up LaTeX templates for PhD and Master's thesis documents
sjtuthesis-guide
Write SJTU theses using the SJTUThesis LaTeX template with full compliance
novathesis-guide
LaTeX thesis template supporting multiple universities and formats
graphical-abstract-guide
Create SVG graphical abstracts for journal paper submissions
beamer-presentation-guide
Guide to creating academic presentations with LaTeX Beamer
plagiarism-detection-guide
Use plagiarism detection tools and ensure manuscript originality
paper-polish-guide
Review and polish LaTeX research papers for clarity and style
grammar-checker-guide
Use grammar and style checking tools to polish academic manuscripts
conciseness-editing-guide
Eliminate wordiness and redundancy in academic prose for clarity
academic-translation-guide
Academic translation, post-editing, and Chinglish correction guide