pdf-analyzer
Extract text, tables, metadata, and structured data from PDF files. Use when a user asks to read a PDF, parse a PDF, extract data from a PDF, summarize a PDF document, pull tables from a PDF, or convert PDF content to structured formats like JSON or CSV. Handles single and multi-page documents, scanned PDFs, and PDFs with complex table layouts.
Best use case
pdf-analyzer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Extract text, tables, metadata, and structured data from PDF files. Use when a user asks to read a PDF, parse a PDF, extract data from a PDF, summarize a PDF document, pull tables from a PDF, or convert PDF content to structured formats like JSON or CSV. Handles single and multi-page documents, scanned PDFs, and PDFs with complex table layouts.
Teams using pdf-analyzer should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/pdf-analyzer/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How pdf-analyzer Compares
| Feature / Agent | pdf-analyzer | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Extract text, tables, metadata, and structured data from PDF files. Use when a user asks to read a PDF, parse a PDF, extract data from a PDF, summarize a PDF document, pull tables from a PDF, or convert PDF content to structured formats like JSON or CSV. Handles single and multi-page documents, scanned PDFs, and PDFs with complex table layouts.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# PDF Analyzer
## Overview
Extract text, tables, and structured data from PDF files and convert them into usable formats. This skill handles text extraction, table detection, metadata reading, and output formatting for single or multi-page PDFs.
## Instructions
When a user asks you to analyze, read, parse, or extract data from a PDF file, follow these steps:
### Step 1: Identify the PDF and goal
Determine the file path and what the user wants extracted:
- **Full text**: All readable text from every page
- **Tables**: Structured tabular data
- **Metadata**: Title, author, creation date, page count
- **Specific sections**: Targeted content from certain pages
- **Summary**: A condensed version of the document contents
### Step 2: Choose the extraction method
Write a Python script using one of these libraries (prefer pdfplumber for tables, PyMuPDF for speed):
**For text extraction:**
```python
import pdfplumber
def extract_text(pdf_path):
text_by_page = []
with pdfplumber.open(pdf_path) as pdf:
for i, page in enumerate(pdf.pages):
text = page.extract_text()
if text:
text_by_page.append({"page": i + 1, "text": text.strip()})
return text_by_page
```
**For table extraction:**
```python
import pdfplumber
import csv
def extract_tables(pdf_path, output_csv=None):
all_tables = []
with pdfplumber.open(pdf_path) as pdf:
for i, page in enumerate(pdf.pages):
tables = page.extract_tables()
for table in tables:
headers = table[0]
rows = table[1:]
all_tables.append({
"page": i + 1,
"headers": headers,
"rows": rows
})
if output_csv and all_tables:
with open(output_csv, "w", newline="") as f:
writer = csv.writer(f)
writer.writerow(all_tables[0]["headers"])
for table in all_tables:
writer.writerows(table["rows"])
return all_tables
```
**For metadata:**
```python
import pdfplumber
def extract_metadata(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
return {
"pages": len(pdf.pages),
"metadata": pdf.metadata
}
```
### Step 3: Run the script and format output
Execute the script, then present results in the format the user needs (plain text, JSON, CSV, markdown table, or summary).
### Step 4: Handle errors gracefully
If extraction fails, try these fallback approaches:
1. Switch from pdfplumber to PyMuPDF (fitz)
2. For scanned PDFs, suggest OCR with pytesseract
3. For encrypted PDFs, inform the user a password is needed
## Examples
### Example 1: Extract all tables from an invoice PDF
**User request:** "Pull the line items table from invoice.pdf and save as CSV"
**Actions taken:**
1. Open `invoice.pdf` with pdfplumber
2. Detect tables on each page
3. Extract headers and rows
4. Write to `invoice_items.csv`
**Output:**
```
Found 1 table on page 1 with 5 columns: Item, Quantity, Unit Price, Tax, Total
Extracted 12 rows. Saved to invoice_items.csv.
Preview:
| Item | Quantity | Unit Price | Tax | Total |
|-----------------|----------|------------|-------|---------|
| Widget A | 100 | $2.50 | $25.00| $275.00 |
| Widget B | 50 | $4.00 | $20.00| $220.00 |
```
### Example 2: Summarize a research paper
**User request:** "Summarize the key findings from paper.pdf"
**Actions taken:**
1. Extract full text from all pages
2. Identify abstract, introduction, results, and conclusion sections
3. Produce a structured summary
**Output:**
```
Document: paper.pdf (14 pages)
Title: Effects of Sleep Duration on Cognitive Performance
Authors: Smith et al., 2024
Key Findings:
- Participants sleeping < 6 hours showed 23% slower reaction times
- Optimal cognitive performance observed at 7-8 hours of sleep
- No significant benefit observed beyond 9 hours
Methods: Randomized controlled trial, n=240, over 12 weeks
Conclusion: 7-8 hours of sleep optimizes cognitive performance in adults
```
## Guidelines
- Always check if the PDF is readable before attempting extraction. Some PDFs are image-only and require OCR.
- For large PDFs (100+ pages), process in batches and show progress.
- When extracting tables, validate that column counts are consistent across rows. Merged cells often cause misalignment.
- Preserve the original page numbers in output so the user can cross-reference.
- If a PDF has both text and scanned pages, extract text where available and flag scanned pages for OCR.
- Never assume table headers. Always use the first row unless the user specifies otherwise.
- For multi-column layouts (academic papers), extract text in reading order, not left-to-right across columns.Related Skills
web-vitals-analyzer
Analyze and optimize Core Web Vitals (LCP, CLS, INP) and frontend performance. Use when a user asks to improve page speed, fix layout shifts, reduce loading times, analyze Lighthouse reports, optimize bundle size, or improve Google PageSpeed scores. Covers image optimization, code splitting, font loading, render-blocking resources, and JavaScript execution costs.
tech-debt-analyzer
Scans codebases for technical debt signals and prioritizes them by business impact. Finds TODO/FIXME/HACK comments, outdated dependencies, code duplication, and correlates with git history to identify high-churn debt hotspots. Use when someone asks about technical debt, code quality audit, refactoring priorities, or maintainability assessment. Trigger words: tech debt, code quality, refactoring, TODOs, maintainability, code health.
log-analyzer
Analyze application logs, server logs, and error traces to identify root causes, patterns, and anomalies. Use when debugging production incidents, investigating error spikes, parsing crash reports, or correlating events across multiple log sources. Trigger words: logs, errors, stack trace, crash, exception, debug, incident, 500 errors, timeout, latency spike.
dns-record-analyzer
Audits and troubleshoots DNS records for domains including A, AAAA, CNAME, MX, TXT, SPF, DKIM, DMARC, CAA, and NS records. Use when someone needs to verify DNS configuration, debug DNS propagation issues, check email authentication records, or audit domain security. Trigger words: DNS records, dig, nslookup, SPF, DKIM, DMARC, MX records, DNS propagation, nameservers, CAA, domain configuration.
cloud-resource-analyzer
Finds orphaned, idle, and underutilized cloud resources across AWS, GCP, or Azure accounts. Use when someone needs to audit cloud spending, find unused EBS volumes, stale snapshots, unattached IPs, idle load balancers, or oversized RDS instances. Trigger words: cloud waste, orphaned resources, unused volumes, cloud audit, infrastructure cleanup, cloud bill analysis.
zustand
You are an expert in Zustand, the small, fast, and scalable state management library for React. You help developers manage global state without boilerplate using Zustand's hook-based stores, selectors for performance, middleware (persist, devtools, immer), computed values, and async actions — replacing Redux complexity with a simple, un-opinionated API in under 1KB.
zoho
Integrate and automate Zoho products. Use when a user asks to work with Zoho CRM, Zoho Books, Zoho Desk, Zoho Projects, Zoho Mail, or Zoho Creator, build custom integrations via Zoho APIs, automate workflows with Deluge scripting, sync data between Zoho apps and external systems, manage leads and deals, automate invoicing, build custom Zoho Creator apps, set up webhooks, or manage Zoho organization settings. Covers Zoho CRM, Books, Desk, Projects, Creator, and cross-product integrations.
zod
You are an expert in Zod, the TypeScript-first schema declaration and validation library. You help developers define schemas that validate data at runtime AND infer TypeScript types at compile time — eliminating the need to write types and validators separately. Used for API input validation, form validation, environment variables, config files, and any data boundary.
zipkin
Deploy and configure Zipkin for distributed tracing and request flow visualization. Use when a user needs to set up trace collection, instrument Java/Spring or other services with Zipkin, analyze service dependencies, or configure storage backends for trace data.
zig
Expert guidance for Zig, the systems programming language focused on performance, safety, and readability. Helps developers write high-performance code with compile-time evaluation, seamless C interop, no hidden control flow, and no garbage collector. Zig is used for game engines, operating systems, networking, and as a C/C++ replacement.
zed
Expert guidance for Zed, the high-performance code editor built in Rust with native collaboration, AI integration, and GPU-accelerated rendering. Helps developers configure Zed, create custom extensions, set up collaborative editing sessions, and integrate AI assistants for productive coding.
zeabur
Expert guidance for Zeabur, the cloud deployment platform that auto-detects frameworks, builds and deploys applications with zero configuration, and provides managed services like databases and message queues. Helps developers deploy full-stack applications with automatic scaling and one-click marketplace services.