literature-statistics

Generate statistics for publication-year and journal distributions from local references or PDFs; use when you need standardized Year/Journal tables and a summary without any network access.

53 stars

byaipoch

View on GitHub Installation ↓

Best use case

literature-statistics is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Generate statistics for publication-year and journal distributions from local references or PDFs; use when you need standardized Year/Journal tables and a summary without any network access.

Teams using literature-statistics should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/literature-statistics/SKILL.md --create-dirs "https://raw.githubusercontent.com/aipoch/medical-research-skills/main/scientific-skills/Other/literature-statistics/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/literature-statistics/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How literature-statistics Compares

Feature / Agent	literature-statistics	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Generate statistics for publication-year and journal distributions from local references or PDFs; use when you need standardized Year/Journal tables and a summary without any network access.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

> **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills)

## When to Use
- You have a batch of references and need a **publication year distribution** table (counts and percentages).
- You need a **journal distribution** table (Top N optional) for a literature review or report appendix.
- Your input is **pasted citations** (BibTeX/RIS/EndNote/plain text/mixed) and you want quick aggregation.
- Your input is **local reference files** (`.bib/.ris/.txt/.csv`) and you want consistent, standardized output.
- You have a **local PDF folder** and want to extract year/journal signals (best-effort) and summarize them.

## Key Features
- Supports multiple input types: pasted text, local reference files, and local PDF directories (via script).
- Extracts **Year** and **Journal** using format-specific parsing rules (BibTeX/RIS/plain text/PDF).
- Produces two standardized tables:
- **Year distribution**: `year, count, percent`
- **Journal distribution**: `journal title, count, percent`
- Provides a summary including totals and unknown-field counts (unknown year / unknown journal).
- Conservative extraction: does **not** guess when metadata is unclear; ambiguous items are counted as `unknown`.
- Local-only operation: no network calls, no external APIs, no credential usage.

## Dependencies
- Python **3.9+**
- Python packages (pinned by your project file):
- `pip install -r scripts/requirements.txt`

## Example Usage
### 1) Process a local PDF directory
```bash
python scripts/process_pdfs.py --input-dir "./pdfs" --output "./literature_stats.md"
```

### 2) Process a local reference file (example pattern)
If your repository provides a CLI entry or script for reference files, run it similarly to the PDF script. For example:
```bash
python scripts/process_references.py --input "./refs/library.bib" --output "./literature_stats.md"
```

### 3) Expected output format (Markdown)
```md
## Summary
- Total processed: 120
- Unknown year: 7
- Unknown journal: 15

## Year Distribution
| Year | Count | Percent |
|------|-------|---------|
| 2023 | 18 | 15.0% |
| 2022 | 22 | 18.3% |
| ... | ... | ... |

## Journal Distribution
| Journal | Count | Percent |
|---------|-------|---------|
| Journal of X | 9 | 7.5% |
| ... | ... | ... |
```

For additional examples, see: `references/examples.md`.

## Implementation Details
### Processing Pipeline
1. Detect input type: pasted text / file path / PDF directory.
2. Read content from pasted text or local files.
3. Split into individual citations using format cues:
- BibTeX entries
- RIS records
- blank-line separation for plain text/mixed inputs
4. Extract `year` and `journal` using the parsing rules below.
5. Normalize journal names using the normalization rules below.
6. Aggregate counts and compute percentages.
7. Output:
- Table 1: Year distribution
- Table 2: Journal distribution
- Summary: totals + unknown counts
8. For PDF directories, use:
```bash
python scripts/process_pdfs.py --input-dir "<pdf_dir>" --output "<output_md>"
```

### Parsing Rules
#### BibTeX
- **Year**: `year` field
- **Journal**: `journal` field

#### RIS
- **Year**: `PY` or `Y1` (use the first 4-digit year)
- **Journal**: first non-empty value among `JO` / `JF` / `T2`

#### Plain Text / Mixed Citations
- **Year**: first 4-digit year in the range **1900-2099** found near the end of the citation
- **Journal**: infer only when patterns are unambiguous (e.g., `Journal Name. 2022;` or `Journal Name, 2022`); otherwise set to `unknown`

#### PDF Directory (Script-Based)
- **Year**: prefer PDF metadata; otherwise use the first 4-digit year found on the first page
- **Journal**: prefer PDF metadata; otherwise scan first-page lines containing keywords such as:
- `Journal`, `Proceedings`, `Transactions`
If unclear, set to `unknown`.

### Journal Normalization Rules
- Trim leading/trailing whitespace.
- Collapse multiple spaces into a single space.
- Remove trailing periods and commas.
- If casing is inconsistent, convert to **Title Case**; otherwise keep original casing.
- Do **not** expand abbreviations or infer aliases.

### Failure Handling and Safety Constraints
- Do not guess missing/unclear year or journal values.
- Count ambiguous entries as `unknown` and report the totals in the summary.
- No network access; no external APIs; no credentials.
- Do not read files outside the user-provided paths.

### Sorting and Reporting Requirements
- Tables are sorted by:
1) `count` descending
2) then by `name` ascending (year or journal title)
- Always report:
- total processed count
- unknown year count
- unknown journal count

## When Not to Use

- Do not use this skill when the required source data, identifiers, files, or credentials are missing.
- Do not use this skill when the user asks for fabricated results, unsupported claims, or out-of-scope conclusions.
- Do not use this skill when a simpler direct answer is more appropriate than the documented workflow.

## Required Inputs

- A clearly specified task goal aligned with the documented scope.
- All required files, identifiers, parameters, or environment variables before execution.
- Any domain constraints, formatting requirements, and expected output destination if applicable.

## Recommended Workflow

1. Validate the request against the skill boundary and confirm all required inputs are present.
2. Select the documented execution path and prefer the simplest supported command or procedure.
3. Produce the expected output using the documented file format, schema, or narrative structure.
4. Run a final validation pass for completeness, consistency, and safety before returning the result.

## Output Contract

- Return a structured deliverable that is directly usable without reformatting.
- If a file is produced, prefer a deterministic output name such as `literature_statistics_result.md` unless the skill documentation defines a better convention.
- Include a short validation summary describing what was checked, what assumptions were made, and any remaining limitations.

## Validation and Safety Rules

- Validate required inputs before execution and stop early when mandatory fields or files are missing.
- Do not fabricate measurements, references, findings, or conclusions that are not supported by the provided source material.
- Emit a clear warning when credentials, privacy constraints, safety boundaries, or unsupported requests affect the result.
- Keep the output safe, reproducible, and within the documented scope at all times.

## Failure Handling

- If validation fails, explain the exact missing field, file, or parameter and show the minimum fix required.
- If an external dependency or script fails, surface the command path, likely cause, and the next recovery step.
- If partial output is returned, label it clearly and identify which checks could not be completed.

## Quick Validation

Run this minimal verification path before full execution when possible:

```bash
python scripts/process_pdfs.py --help
```

Expected output format:

```text
Result file: literature_statistics_result.md
Validation summary: PASS/FAIL with brief notes
Assumptions: explicit list if any
```

Related Skills

literatureimages-interpretation

from aipoch/medical-research-skills

Interpret figures in academic papers and their captions when the input is a PDF-to-Markdown document with page markers and image links, producing a structured Markdown report for extracting variables, trends, and conclusions.

literature-management

from aipoch/medical-research-skills

Import local literature into a managed library; trigger when you need offline deduplication, tagging, and a searchable index.

literature-filtering

from aipoch/medical-research-skills

Filter literature by publication year, journal, and predefined screening rules to produce inclusion/exclusion lists; use when conducting preliminary screening or systematic review screening to narrow the literature scope.

literature-extensive-read

from aipoch/medical-research-skills

Rapidly skim and summarize academic papers (default:PDF-to-Markdown full text with `## Page XX` pagination and image references) and output a structured extensive-reading summary in Markdown when you need to quickly understand research questions, methods, key results, conclusions, and decide whether intensive reading is worthwhile.

literature-experiment-extract

from aipoch/medical-research-skills

Extract experimental models, experimental methods, and biomarker information from paper Markdown (typically produced by PDF-to-Markdown tools) when a user provides paper Markdown and needs a structured, evidence-backed summary (1 Markdown + 3 CSVs).

literature-close-read

from aipoch/medical-research-skills

Produce a structured close-reading report from a paper's full PDF-to-Markdown text (with `## Page XX` pagination and image references) when you need to systematically extract background, research questions, methods, results, limitations, and reproducible experimental details.

multi-database-literature-collector

from aipoch/medical-research-skills

Collects candidate biomedical literature across multiple databases, adapts search logic by database, preserves source metadata, and organizes results into a structured, screening-ready candidate pool. Always use this skill when a user wants cross-database literature collection, search strategy construction, candidate paper aggregation, or first-pass evidence organization before deduplication, screening, layered reading, or review planning. Requires real and verifiable literature records only. Every formal literature item must include a real link and DOI when available; never fabricate citations, titles, authors, years, journals, abstracts, PMIDs, or DOIs. If a DOI is unavailable or cannot be verified, state that explicitly rather than inventing one.

medical-research-literature-reader-pro

from aipoch/medical-research-skills

A medical-research-native literature reading skill for users with clinical, bioinformatics, translational, and basic experimental backgrounds. Use this skill whenever a user wants to read, analyze, critique, or interpret a medical or scientific paper — whether they provide a PDF, abstract, DOI, PMID, or just a title. Triggers include requests like "analyze this paper", "critique this study", "is this a strong paper?", "give me similar studies", "prepare me for journal club", "help me understand this bioinformatics paper", "what are the weaknesses here?", or "turn this into a mind map". Also activate for any downstream deliverables such as journal club kits, comparison tables, PI decision briefs, replication starters, or follow-up experiment designs. Do NOT treat as a generic summarizer — this skill performs structured evidence-type classification, track-specific critical appraisal, interpretation-boundary judgment, and research-grade follow-up generation.

skill-auditor

from aipoch/medical-research-skills

A comprehensive auditor for any agent skill — including Manus, OpenClaw/ClawHub, Claude, LobeHub, or custom SKILL.md-based skills. Use this skill whenever a user wants to evaluate, audit, review, score, or quality-check an agent skill before publishing, updating, or deploying. Covers two hard veto gates (structural redlines + research integrity redlines), static quality scoring across 25 criteria (ISO 25010 + OpenSSF + Agent), dynamic test input generation, multi-mode execution testing, multi-layer output evaluation with five specialized category rubrics (Evidence Insight / Protocol Design / Data Analysis / Academic Writing / Other), a Research Veto that applies to all four research categories, human eval viewer generation, actionable P0/P1/P2 optimization recommendations, and automatic skill improvement that outputs a polished, production-ready SKILL.md. Also use whenever a user says "audit my skill", "evaluate my skill", "improve my skill", or wants a corrected version after evaluation.

two-sample-mr-research-planner

from aipoch/medical-research-skills

Generates complete two-sample Mendelian randomization (MR) research designs from a user-provided research direction. Use when users want to design, plan, or build a study using two-sample MR to test causal relationships. Triggers:"design a two-sample MR study", "build a publishable MR paper", "test whether this biomarker causally affects this disease", "generate Lite/Standard/Advanced MR plans", "screen multiple exposures with MR", "bidirectional MR design", "causal inference using GWAS summary statistics", or "I want to study X and Y using MR". Always outputs four workload configurations (Lite / Standard / Advanced / Publication+) with a recommended primary plan, step-by-step workflow, figure plan, validation strategy, minimal executable version, and publication upgrade path.

research-proposal-generator

from aipoch/medical-research-skills

Generates a comprehensive research proposal design based on input literature, including hypothesis, mechanism verification, and budget. Use when the user wants to design a research project from a paper.

research-grants

from aipoch/medical-research-skills

Write competitive research proposals for NSF, NIH, DOE, DARPA, and Taiwan's NSTC when you need agency-compliant narratives, budgets, and review-criteria alignment for a specific solicitation/FOA/BAA.