pdf-analyzer
Analyze PDF, DOCX, and spreadsheet documents using vision models. Converts documents to images and extracts insights with layout preservation. Uses VT Code's native document processor (no container skills required).
Best use case
pdf-analyzer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Analyze PDF, DOCX, and spreadsheet documents using vision models. Converts documents to images and extracts insights with layout preservation. Uses VT Code's native document processor (no container skills required).
Teams using pdf-analyzer should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/pdf-analyzer/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How pdf-analyzer Compares
| Feature / Agent | pdf-analyzer | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Analyze PDF, DOCX, and spreadsheet documents using vision models. Converts documents to images and extracts insights with layout preservation. Uses VT Code's native document processor (no container skills required).
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# PDF/DOCX/Spreadsheet Analyzer
## Overview
This skill provides OpenAI-style document processing capabilities, converting PDFs, Word documents, and spreadsheets into rendered images for vision model analysis. This approach preserves layout, formatting, tables, and visual elements that would be lost in traditional text extraction.
**Inspired by:** OpenAI's implementation in ChatGPT's Code Interpreter (per Simon Willison's blog post, Dec 2025)
## Supported Formats
- **PDF**: Multi-page documents, converted page-by-page to PNGs
- **DOCX/DOC**: Word documents with full formatting preservation
- **XLSX/XLS/CSV**: Spreadsheets rendered as visual tables
- **Images**: Direct analysis (PNG, JPG, JPEG, GIF, BMP, TIFF)
## Usage in VT Code
### Basic Document Analysis
```bash
# List available skills
vtcode list-skills
# Load the skill
vtcode skills load pdf-analyzer
# In VT Code chat:
/skills load pdf-analyzer
```
### Analyzing Documents
```
User: Analyze this PDF report and summarize the key findings
Assistant: I'll use the PDF analyzer skill to process this document with vision-based analysis.
[Skill processes document, converts to images, analyzes with vision model]
The document contains:
1. Executive summary on page 1 highlighting Q3 revenue growth
2. Detailed financial tables on pages 3-4 showing 15% YoY growth
3. Charts illustrating market expansion opportunities
4. Risk assessment section on page 7
Key insights:
- Revenue increased from $2.1M to $2.4M (14.3% growth)
- New customer acquisition up 23%
- Challenges noted in supply chain efficiency
```
### Processing Spreadsheets
```
User: Review this Excel file and identify any anomalies in the sales data
Assistant: Let me analyze the spreadsheet using vision-based processing to preserve the table structure and formatting.
[Skill renders spreadsheet as image, detects structure, analyzes values]
Findings:
- Row 42: Sales amount of $50,000 with no corresponding customer ID (potential data entry error)
- Column E: Several negative values that may indicate returns or refunds
- Week of 2025-03-15: Unusual 300% spike in orders (verify with sales team)
- Conditional formatting shows 12 overdue payments highlighted in red
```
### Document Comparison
```
User: Compare these two contract versions and highlight the changes
Assistant: I'll process both documents using vision analysis to detect textual and formatting changes.
[Skill processes both documents, compares rendered pages]
Differences detected:
- Section 3.2: Payment terms changed from "Net 30" to "Net 15"
- Clause 7.1: Liability cap increased from $100k to $250k
- Page 5: New confidentiality provision added
- Header formatting: Company logo position adjusted
```
## Technical Implementation
### Architecture
```
Document → Renderer → PNG Images → Vision Model → Analysis → Insights
```
1. **Document Loading**: Supports PDF, DOCX, XLSX, CSV, and image formats
2. **Rendering**: Converts pages/sheets to high-quality PNG images (150 DPI default)
3. **Vision Processing**: Sends images to vision-enabled LLM for analysis
4. **Text Extraction**: Optionally runs OCR for searchable text layer
5. **Insight Generation**: Combines visual analysis with extracted content
### Key Features
**Layout Preservation:**
- Tables maintain row/column structure
- Charts and graphs remain visually intact
- Formatting (bold, italic, colors) is preserved
- Page structure and flow are maintained
**Multi-page Processing:**
- Automatic pagination for long documents
- Cross-page context awareness
- Page number references in analysis
**Vision Model Integration:**
- Sends rendered images to Gemini/Claude/GPT-4 vision models
- Maintains spatial relationships in analysis
- Extracts both textual and visual information
### Token Efficiency
- Images compressed to optimal size for vision models
- Smart page selection for long documents
- Caching of rendered pages
- Configurable DPI (75/150/300) based on detail requirements
## Configuration
### Environment Variables
```bash
# Document processing
export VT_CODE_DOC_DPI=150 # Rendering DPI (default: 150)
export VT_CODE_DOC_MAX_PAGES=50 # Max pages per document (default: 50)
export VT_CODE_DOC_ENABLE_OCR=true # Enable OCR fallback (default: true)
# Vision model selection
export VT_CODE_VISION_MODEL=gemini-2.5-pro # Model for document analysis
```
### VTCode Configuration
```toml
[skills.pdf-analyzer]
enabled = true
dpi = 150
max_pages = 50
enable_ocr = true
vision_model = "gemini-2.5-pro"
```
## Examples
### Example 1: Financial Report Analysis
```
User: Analyze Q3_Financial_Report.pdf and give me the key metrics
Assistant: Processing the financial report using vision-based analysis...
Key Metrics Extracted:
• Total Revenue: $2.4M (Q3 2025)
• Gross Margin: 68%
• Operating Expenses: $1.2M
• Net Income: $432K (18% margin)
• Cash Flow: +$156K from operations
Trends:
• Revenue up 14.3% vs Q2
• Margin improvement of 2.1 percentage points
• Customer acquisition cost decreased 8%
[Tables and charts referenced by page number]
```
### Example 2: Contract Review
```
User: Review this NDA and highlight any unusual clauses
Assistant: Analyzing the Non-Disclosure Agreement...
Contract Review Summary:
✓ Standard confidentiality provisions
✓ Reasonable duration (2 years)
⚠️ Unusual clause found: Section 5.3 - "Perpetual confidentiality for trade secrets"
⚠️ Geographic scope: Worldwide (typical, but confirm if appropriate)
✓ Standard exclusions for public information
Recommendation: The perpetual trade secret clause is enforceable but consider:
- Defining "trade secrets" more specifically
- Adding a severability provision
```
### Example 3: Research Paper Analysis
```
User: Summarize this research paper and extract the methodology
Assistant: Processing the research paper with vision analysis...
Paper Summary:
"Advanced Techniques in Machine Learning Optimization"
Authors: Smith et al. (2025)
Abstract Summary:
The paper presents a novel optimization algorithm that improves convergence
speed by 40% compared to Adam optimizer...
Methodology (from pages 3-5):
1. Dataset: CIFAR-10, ImageNet, and custom benchmarks
2. Architecture: ResNet-50, ViT-B/16
3. Hyperparameters: Learning rate 1e-3, batch size 256
4. Novel contribution: Adaptive momentum scheduling
Results (pages 6-7):
• 40% faster convergence vs baseline
• 2.3% accuracy improvement on ImageNet
• Robust to hyperparameter variations
[Referenced figures: Fig 2 (convergence curve), Fig 4 (accuracy comparison)]
```
## Advanced Usage
### Batch Processing
```bash
# Process multiple documents
for doc in *.pdf; do
vtcode ask "Analyze $doc and save summary to ${doc%.pdf}_summary.txt"
done
```
### Integration with CLI
```bash
# Process document and extract specific information
vtcode skills load pdf-analyzer
vtcode ask "Extract all email addresses from contract.pdf"
```
### Custom Vision Prompts
```
User: Analyze this document focusing only on financial data and risks
Assistant: I'll use a custom vision prompt to focus specifically on financial information and risk factors...
[Custom prompt guides model to focus on specific elements]
```
## Troubleshooting
### Common Issues
**Document too large:**
```
Error: Document exceeds maximum page limit (50 pages)
Solution: Split document or increase VT_CODE_DOC_MAX_PAGES
```
**Vision model unavailable:**
```
Error: Vision model not configured
Solution: Set VT_CODE_VISION_MODEL or use default Gemini vision model
```
**Rendering fails:**
```
Error: Unable to render document
Solution: Verify file format is supported and not corrupted
```
### Debug Mode
```bash
export VT_CODE_DEBUG_DOCUMENTS=true
vtcode skills load pdf-analyzer
vtcode ask "Analyze document.pdf"
# Will show detailed rendering and processing logs
```
## Performance Tips
1. **Use appropriate DPI**: 150 DPI is good for most documents
2. **Limit page range**: Process only relevant pages for large documents
3. **Batch processing**: Process multiple documents in parallel
4. **Cache results**: Reuse processed documents when possible
5. **Choose efficient vision models**: Gemini 2.5 Flash for speed, Pro for detail
## Security Considerations
- Documents processed locally, not sent to external services
- Vision model calls use standard LLM APIs with encryption
- Temporary files cleaned up after processing
- No document content stored permanently
## License
MIT License - See VTCode main repository for details.Related Skills
excel-field-analyzer
分析Excel/CSV字段结构,AI自动生成中英文映射,验证翻译质量,输出统计报告。用于电子表格分析、数据字典创建、字段映射场景。
ab-testing-analyzer
全面的AB测试分析工具,支持实验设计、统计检验、用户分群分析和可视化报告生成。用于分析产品改版、营销活动、功能优化等AB测试结果,提供统计显著性检验和深度洞察。
video-analyzer
鏅鸿兘鍒嗘瀽 Bilibili/YouTube/鏈湴瑙嗛锛岀敓鎴愯浆鍐欍€佽瘎浼板拰鎬荤粨銆傛敮鎸佸叧閿抚鎴浘鑷姩宓屽叆銆?
edu-video-analyzer
Analyze educational YouTube channels for classroom adoption potential, curriculum alignment, and pedagogical effectiveness. Use when comparing educational video content (like MRU vs Crash Course), evaluating teaching methodologies, identifying content gaps for course design, or developing educational video strategy focused on student learning outcomes rather than monetization.
blog-voice-analyzer
Run the AI Voice Analyzer on blog content to detect AI-sounding patterns and get actionable rewrite suggestions. Use when reviewing or improving blog articles before publishing.
ecommerce-competitor-analyzer
Multi-platform e-commerce competitor analysis skill that automatically scrapes product data from Amazon, Temu, Shopee and generates comprehensive analysis reports using AI. Use when you need to analyze competitor products, extract product insights, or batch analyze multiple product listings. Supports bulk processing with structured outputs including title, price, rating, reviews, and strategic analysis.
api-spec-analyzer
Analyzes API documentation from OpenAPI specs to provide TypeScript interfaces, request/response formats, and implementation guidance. Use when implementing API integrations, debugging API errors (400, 401, 404), replacing mock APIs, verifying data types, or when user mentions endpoints, API calls, or backend integration.
api-schema-analyzer
Analyze OpenAPI and Postman schemas for MCP tool generation. Use when analyzing API specifications, extracting endpoint information, generating tool signatures, or when user mentions OpenAPI, Swagger, API schema, endpoint analysis.
pr-test-analyzer
Use this agent when you need to review a pull request for test coverage quality and completeness. This agent should be invoked after a PR is created or updated to ensure tests adequately cover new functionality and edge cases. Examples:\n\n<example>\nContext: Daisy has just created a pull request with new functionality.\nuser: "I've created the PR. Can you check if the tests are thorough?"\nassistant: "I'll use the pr-test-analyzer agent to review the test coverage and identify any critical gaps."\n<commentary>\nSince Daisy is asking about test thoroughness in a PR, use the Task tool to launch the pr-test-analyzer agent.\n</commentary>\n</example>\n\n<example>\nContext: A pull request has been updated with new code changes.\nuser: "The PR is ready for review - I added the new validation logic we discussed"\nassistant: "Let me analyze the PR to ensure the tests adequately cover the new validation logic and edge cases."\n<commentary>\nThe PR has new functionality that needs test coverage analysis, so use the pr-test-analyzer agent.\n</commentary>\n</example>\n\n<example>\nContext: Reviewing PR feedback before marking as ready.\nuser: "Before I mark this PR as ready, can you double-check the test coverage?"\nassistant: "I'll use the pr-test-analyzer agent to thoroughly review the test coverage and identify any critical gaps before you mark it ready."\n<commentary>\nDaisy wants a final test coverage check before marking PR ready, use the pr-test-analyzer agent.\n</commentary>\n</example>
analyzer-agent
Static analysis, code quality checks, and security scanning agent
ai-transcript-analyzer
Analyze transcript files using OpenAI API (gpt-5-mini) to extract insights, summaries, key topics, quotes, and action items. This skill should be used when users have transcript files (from WhisperKit, YouTube, podcasts, meetings, etc.) and want AI-powered analysis, summaries, or custom insights extracted from the content. Supports both default comprehensive analysis and custom prompts for specific information extraction.
agent-chain-analyzer
Detects and analyzes agent chain anti-patterns where agents invoke other agents sequentially causing massive context load