Best use case
Table Extractor is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
## Overview
Teams using Table Extractor should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/table-extractor/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How Table Extractor Compares
| Feature / Agent | Table Extractor | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
## Overview
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Table Extractor
## Overview
Extract tables from PDF documents with high accuracy using camelot-py. Handles complex table structures including merged cells, multi-line rows, spanning headers, and borderless tables. Outputs clean DataFrames that can be exported to CSV, Excel, or JSON.
## Instructions
When a user asks you to extract tables from a PDF, follow this process:
### Step 1: Install and verify dependencies
```bash
# Install camelot and its dependencies
pip install "camelot-py[base]" ghostscript opencv-python-headless pandas
# Verify ghostscript is available (required by camelot)
gs --version 2>/dev/null || echo "Install ghostscript: sudo apt install ghostscript"
```
If ghostscript is not available, fall back to pdfplumber:
```bash
pip install pdfplumber pandas
```
### Step 2: Inspect the PDF to locate tables
```python
import camelot
# Quick scan: how many tables are in the document?
tables = camelot.read_pdf("document.pdf", pages="all", flavor="lattice")
print(f"Found {len(tables)} tables using lattice detection")
# If no tables found, try stream detection (for borderless tables)
if len(tables) == 0:
tables = camelot.read_pdf("document.pdf", pages="all", flavor="stream")
print(f"Found {len(tables)} tables using stream detection")
# Summary of each table
for i, table in enumerate(tables):
print(f"\nTable {i}: {table.shape[0]} rows x {table.shape[1]} cols (page {table.page})")
print(f"Accuracy: {table.accuracy:.1f}%")
print(table.df.head(3))
```
### Step 3: Choose the right extraction flavor
**Lattice flavor** (for tables with visible borders/gridlines):
```python
tables = camelot.read_pdf(
"document.pdf",
pages="1,2,3", # Specific pages
flavor="lattice",
line_scale=40, # Adjust line detection sensitivity
process_background=True # Detect lines on colored backgrounds
)
```
**Stream flavor** (for borderless tables, whitespace-separated):
```python
tables = camelot.read_pdf(
"document.pdf",
pages="1",
flavor="stream",
edge_tol=50, # Tolerance for edge detection
row_tol=10, # Tolerance for grouping text into rows
columns=["72,200,350,500"] # Manual column boundaries if auto-detect fails
)
```
### Step 4: Clean and process extracted tables
```python
import pandas as pd
for i, table in enumerate(tables):
df = table.df
# Promote first row to header if it contains column names
if df.iloc[0].str.match(r'^[A-Za-z]').all():
df.columns = df.iloc[0]
df = df[1:].reset_index(drop=True)
# Clean whitespace and newlines within cells
df = df.apply(lambda col: col.str.strip().str.replace(r'\n', ' ', regex=True))
# Remove completely empty rows
df = df.dropna(how='all').replace('', pd.NA).dropna(how='all')
# Convert numeric columns
for col in df.columns:
try:
df[col] = pd.to_numeric(df[col].str.replace(',', '').str.replace('$', ''))
except (ValueError, AttributeError):
pass # Keep as string
print(f"\nCleaned Table {i}:")
print(df.head())
```
### Step 5: Handle complex table structures
**Merged cells and spanning headers:**
```python
# Forward-fill merged cells (common in row headers)
df.iloc[:, 0] = df.iloc[:, 0].replace('', pd.NA).ffill()
# Handle multi-level column headers
if df.iloc[0:2].apply(lambda x: x.str.len().mean()).mean() < 20:
# Combine first two rows as multi-level header
new_cols = df.iloc[0] + " - " + df.iloc[1]
df.columns = new_cols.str.strip(" - ")
df = df[2:].reset_index(drop=True)
```
**Tables spanning multiple pages:**
```python
# Extract from all pages and concatenate
all_tables = camelot.read_pdf("document.pdf", pages="all", flavor="lattice")
# Group tables that are continuations (same column count)
groups = {}
for t in all_tables:
key = t.shape[1]
groups.setdefault(key, []).append(t.df)
for col_count, dfs in groups.items():
combined = pd.concat(dfs, ignore_index=True)
# Remove duplicate header rows that appear at page breaks
combined = combined[~combined.duplicated(keep='first')]
```
### Step 6: Export the results
```python
# CSV (one file per table)
for i, table in enumerate(tables):
table.df.to_csv(f"table_{i+1}.csv", index=False)
# Excel (all tables as separate sheets)
with pd.ExcelWriter("extracted_tables.xlsx") as writer:
for i, table in enumerate(tables):
table.df.to_excel(writer, sheet_name=f"Table_{i+1}", index=False)
# JSON
for i, table in enumerate(tables):
table.df.to_json(f"table_{i+1}.json", orient="records", indent=2)
print(f"Exported {len(tables)} tables")
```
## Examples
### Example 1: Extract financial tables from an annual report
**User request:** "Extract all tables from this annual report PDF"
**Actions:**
1. Scan all pages with lattice flavor (financial reports typically have bordered tables)
2. Identify income statement, balance sheet, and cash flow tables by column headers
3. Clean numeric values (remove $, commas, parentheses for negatives)
4. Export each table to a separate CSV and combine into one Excel workbook
**Output:** "Extracted 7 tables across 42 pages. Exported to extracted_tables.xlsx with sheets: Income_Statement, Balance_Sheet, Cash_Flow, Revenue_Breakdown, Expenses, Quarterly_Summary, KPIs."
### Example 2: Extract a specific table from a research paper
**User request:** "Get the results table from page 8 of this paper"
**Actions:**
1. Target page 8 specifically: `camelot.read_pdf("paper.pdf", pages="8")`
2. If multiple tables on the page, show summaries and let the user pick
3. Clean the extracted table and handle any multi-line cells
4. Export as CSV
**Output:** A single CSV file with the results table, plus a preview of the first few rows printed to the console.
### Example 3: Batch process multiple PDFs
**User request:** "Extract the summary table from each of these 20 monthly reports"
**Actions:**
```python
import glob
results = []
for pdf_path in sorted(glob.glob("reports/*.pdf")):
tables = camelot.read_pdf(pdf_path, pages="1", flavor="lattice")
if tables:
df = tables[0].df # First table on first page
df["source_file"] = pdf_path
results.append(df)
combined = pd.concat(results, ignore_index=True)
combined.to_csv("all_summaries.csv", index=False)
```
**Output:** A single CSV combining the summary table from all 20 reports with a source_file column for traceability.
## Guidelines
- Always try `lattice` flavor first (bordered tables). Fall back to `stream` for borderless tables.
- Check the `accuracy` score on each table. Below 80% indicates extraction issues that need manual review.
- For scanned PDFs, run OCR first (e.g., `ocrmypdf`) before table extraction.
- When camelot struggles, try pdfplumber as an alternative: `page.extract_table(table_settings={...})`.
- Clean numeric data aggressively: remove currency symbols, commas, and handle parenthesized negatives.
- For tables with merged cells, use forward-fill on the appropriate columns.
- When extracting from multiple pages, watch for repeated header rows at page breaks.
- Always preview the extracted data before exporting to catch alignment or parsing issues.
- Report extraction quality metrics (accuracy, row/column count) so the user can verify correctness.Related Skills
predictable-revenue
Build a scalable outbound B2B sales process with specialized roles (SDR, AE, CSM). Use when the user mentions "outbound sales", "Cold Calling 2.0", "prospecting emails", "sales pipeline", "SDR process", or "B2B SaaS sales". Covers lead generation, qualification frameworks, and separating prospecting from closing. For offer design, see hundred-million-offers. For persuasion science, see influence-psychology. Trigger with 'predictable', 'revenue'.
pivot-table-creator
Pivot Table Creator - Auto-activating skill for Data Analytics. Triggers on: pivot table creator, pivot table creator Part of the Data Analytics skill category.
metadata-extractor
Metadata Extractor - Auto-activating skill for Data Pipelines. Triggers on: metadata extractor, metadata extractor Part of the Data Pipelines skill category.
dynamodb-table-designer
Dynamodb Table Designer - Auto-activating skill for AWS Skills. Triggers on: dynamodb table designer, dynamodb table designer Part of the AWS Skills skill category.
bigquery-table-creator
Bigquery Table Creator - Auto-activating skill for GCP Skills. Triggers on: bigquery table creator, bigquery table creator Part of the GCP Skills skill category.
create-adaptable-composable
Create a library-grade Vue composable that accepts maybe-reactive inputs (MaybeRef / MaybeRefOrGetter) so callers can pass a plain value, ref, or getter. Normalize inputs with toValue()/toRef() inside reactive effects (watch/watchEffect) to keep behavior predictable and reactive. Use this skill when user asks for creating adaptable or reusable composables.
azure-data-tables-py
Azure Tables SDK for Python (Storage and Cosmos DB). Use for NoSQL key-value storage, entity CRUD, and batch operations. Triggers: "table storage", "TableServiceClient", "TableClient", "entities", "PartitionKey", "RowKey".
azure-data-tables-java
Build table storage applications with Azure Tables SDK for Java. Use when working with Azure Table Storage or Cosmos DB Table API for NoSQL key-value data, schemaless storage, or structured data at scale.
airtable-automation
Automate Airtable tasks via Rube MCP (Composio): records, bases, tables, fields, views. Always search tools first for current schemas.
epub-chapter-extractor
Extract all chapters from an EPUB file into separate markdown files. Use when the user wants to split an EPUB into individual chapter files, extract EPUB chapters, or convert an ebook to separate markdown documents.
TanStack Table
## Overview
NocoDB — Open-Source Airtable Alternative
## Overview