institutional-repository-guide
Access papers from institutional and subject repositories at scale
Best use case
institutional-repository-guide is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Access papers from institutional and subject repositories at scale
Teams using institutional-repository-guide should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/institutional-repository-guide/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How institutional-repository-guide Compares
| Feature / Agent | institutional-repository-guide | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Access papers from institutional and subject repositories at scale
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Institutional Repository Guide
Institutional repositories (IRs) are university-run digital archives that store and provide open access to their researchers' scholarly output — dissertations, journal articles, conference papers, datasets, and technical reports. Subject repositories like arXiv, bioRxiv, SSRN, and RePEc serve similar functions for specific disciplines. Together, they form a distributed network of open scholarship that complements commercial databases.
This guide covers how to discover, access, and systematically harvest content from institutional and subject repositories for literature reviews, meta-analyses, and research data collection.
## Repository Landscape
### Types of Repositories
```
Institutional Repositories (IR):
- Run by universities to archive their researchers' output
- Examples: DSpace, EPrints, Fedora-based systems
- Discovery: OpenDOAR directory (v2.sherpa.ac.uk/opendoar)
Subject Repositories:
- Discipline-specific archives
- arXiv (physics, CS, math), bioRxiv, SSRN, RePEc, EarthArXiv
Aggregators:
- Harvest from many repositories into a single search interface
- BASE (Bielefeld Academic Search Engine)
- CORE (core.ac.uk, 200M+ open access articles)
- OpenAIRE (European research output)
```
### Discovering Repositories
OpenDOAR (Directory of Open Access Repositories) is the primary registry for finding institutional repositories:
```python
import urllib.request
import json
def search_opendoar(subject: str = None, country: str = None) -> list:
"""
Search the OpenDOAR registry for institutional repositories.
Args:
subject: Filter by subject area (e.g., "Biology", "Computer Science")
country: ISO country code (e.g., "US", "GB", "CN")
"""
base_url = "https://v2.sherpa.ac.uk/cgi/retrieve"
params = "?item-type=repository&format=Json"
if subject:
params += f"&filter=[[\"{subject}\",\"subject\"]]"
if country:
params += f"&filter=[[\"{country}\",\"country\"]]"
req = urllib.request.Request(base_url + params)
response = urllib.request.urlopen(req)
data = json.loads(response.read())
repositories = []
for item in data.get("items", []):
repo_info = {
"name": item.get("repository_metadata", {}).get("name", [{}])[0].get("name", ""),
"url": item.get("repository_metadata", {}).get("url", ""),
"oai_url": item.get("repository_metadata", {}).get("oai_url", ""),
"software": item.get("repository_metadata", {}).get("software", {}).get("name", ""),
"type": item.get("repository_metadata", {}).get("type", "")
}
repositories.append(repo_info)
return repositories
```
## OAI-PMH Harvesting from Repositories
Most institutional repositories support OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting), the standard protocol for metadata exchange:
```python
import xml.etree.ElementTree as ET
import urllib.request
def harvest_repository(base_url: str, metadata_prefix: str = "oai_dc",
set_spec: str = None, from_date: str = None) -> list:
"""
Harvest metadata records from a repository's OAI-PMH endpoint.
Args:
base_url: The OAI-PMH base URL
metadata_prefix: Metadata format (oai_dc, datacite, mets)
set_spec: Optional set/collection to restrict harvesting
from_date: Harvest only records added after this date (YYYY-MM-DD)
"""
params = f"?verb=ListRecords&metadataPrefix={metadata_prefix}"
if set_spec:
params += f"&set={set_spec}"
if from_date:
params += f"&from={from_date}"
url = base_url + params
records = []
while url:
response = urllib.request.urlopen(url)
tree = ET.parse(response)
root = tree.getroot()
ns = {"oai": "http://www.openarchives.org/OAI/2.0/"}
for record in root.findall(".//oai:record", ns):
header = record.find("oai:header", ns)
identifier = header.find("oai:identifier", ns).text
datestamp = header.find("oai:datestamp", ns).text
records.append({"identifier": identifier, "datestamp": datestamp})
token_elem = root.find(".//oai:resumptionToken", ns)
if token_elem is not None and token_elem.text:
url = f"{base_url}?verb=ListRecords&resumptionToken={token_elem.text}"
else:
url = None
return records
```
### Key OAI-PMH Verbs
| Verb | Purpose |
|------|---------|
| `Identify` | Get repository name, admin email, policies |
| `ListSets` | List available collections/sets |
| `ListMetadataFormats` | List supported metadata schemas |
| `ListIdentifiers` | Lightweight listing of record headers |
| `ListRecords` | Full metadata records with pagination |
| `GetRecord` | Retrieve a single record by identifier |
## Major Repository Platforms
### DSpace
The most widely deployed open-source repository platform (used by ~40% of repositories worldwide):
- OAI-PMH endpoint: `{base-url}/oai/request`
- REST API: `{base-url}/server/api`
- Supports Dublin Core, METS, and custom metadata schemas
- Examples: MIT DSpace, University of Cambridge Repository
### EPrints
Popular in the UK and Europe:
- OAI-PMH endpoint: `{base-url}/cgi/oai2`
- REST API: `{base-url}/cgi/export/{id}/{format}`
- Strong support for research output types (articles, theses, conference items)
- Examples: University of Southampton EPrints
### Fedora / Islandora
Used by larger institutions with complex digital collections:
- Typically paired with a discovery layer (Solr/Blacklight)
- Strong support for digital preservation workflows
- Examples: University of Toronto, Smithsonian Institution
## Building a Harvesting Pipeline
### Systematic Collection Workflow
```
1. Identify target repositories
- Use OpenDOAR to find IRs by subject or country
- List subject repositories relevant to your discipline
2. Test endpoints
- Send Identify request to verify the endpoint is active
- Check ListMetadataFormats for available schemas
3. Harvest incrementally
- Use "from" parameter to harvest only new records
- Store last harvest date for each repository
- Respect rate limits (typically 1 request per second)
4. Deduplicate
- Match records by DOI when available
- Use title + author fuzzy matching for records without DOIs
- Flag duplicates rather than deleting (keep provenance)
5. Store and index
- Save metadata in structured format (JSON, SQLite, CSV)
- Build a local search index for efficient retrieval
```
## Ethical Considerations
- Always respect `robots.txt` and repository rate limits
- Metadata harvesting is generally permitted; bulk full-text download may require permission
- Check each repository's terms of use before harvesting
- Use harvested data for research purposes, not commercial redistribution
- Attribute the source repository in publications using harvested data
- Consider reaching out to repository administrators for large-scale harvesting projects
## References
- OpenDOAR: https://v2.sherpa.ac.uk/opendoar/
- OAI-PMH specification: http://www.openarchives.org/OAI/openarchivesprotocol.html
- CORE: https://core.ac.uk
- BASE: https://www.base-search.net
- DSpace documentation: https://wiki.lyrasis.org/display/DSPACERelated Skills
thuthesis-guide
Write Tsinghua University theses using the ThuThesis LaTeX template
thesis-writing-guide
Templates, formatting rules, and strategies for thesis and dissertation writing
thesis-template-guide
Set up LaTeX templates for PhD and Master's thesis documents
sjtuthesis-guide
Write SJTU theses using the SJTUThesis LaTeX template with full compliance
novathesis-guide
LaTeX thesis template supporting multiple universities and formats
graphical-abstract-guide
Create SVG graphical abstracts for journal paper submissions
beamer-presentation-guide
Guide to creating academic presentations with LaTeX Beamer
plagiarism-detection-guide
Use plagiarism detection tools and ensure manuscript originality
paper-polish-guide
Review and polish LaTeX research papers for clarity and style
grammar-checker-guide
Use grammar and style checking tools to polish academic manuscripts
conciseness-editing-guide
Eliminate wordiness and redundancy in academic prose for clarity
academic-translation-guide
Academic translation, post-editing, and Chinglish correction guide