deepseek-ocr
Expert skill for using DeepSeek-OCR, a vision-language model for optical character recognition with context optical compression supporting documents, PDFs, and images.
Best use case
deepseek-ocr is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Expert skill for using DeepSeek-OCR, a vision-language model for optical character recognition with context optical compression supporting documents, PDFs, and images.
Teams using deepseek-ocr should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/deepseek-ocr/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How deepseek-ocr Compares
| Feature / Agent | deepseek-ocr | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Expert skill for using DeepSeek-OCR, a vision-language model for optical character recognition with context optical compression supporting documents, PDFs, and images.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Marketing
Discover AI agents for marketing workflows, from SEO and content production to campaign research, outreach, and analytics.
AI Agents for Startups
Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
SKILL.md Source
# DeepSeek-OCR
> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.
DeepSeek-OCR is a vision-language model for Optical Character Recognition with "Contexts Optical Compression." It supports native and dynamic resolutions, multiple prompt modes (document-to-markdown, free OCR, figure parsing, grounding), and can be run via vLLM (high-throughput) or HuggingFace Transformers. It processes images and PDFs, outputting structured text or markdown.
---
## Installation
### Prerequisites
- CUDA 11.8+, PyTorch 2.6.0
- Python 3.12.9 (via conda recommended)
### Setup
```bash
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR
conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr
# Install PyTorch with CUDA 11.8
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
--index-url https://download.pytorch.org/whl/cu118
# Download vllm-0.8.5 whl from https://github.com/vllm-project/vllm/releases/tag/v0.8.5
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation
```
### Alternative: upstream vLLM (nightly)
```bash
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
```
---
## Model Download
Model is available on HuggingFace: `deepseek-ai/DeepSeek-OCR`
```python
from huggingface_hub import snapshot_download
snapshot_download(repo_id="deepseek-ai/DeepSeek-OCR")
```
---
## Inference: vLLM (Recommended for Production)
### Single Image — Streaming
```python
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor]
)
image = Image.open("document.png").convert("RGB")
prompt = "<image>\nFree OCR."
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(
ngram_size=30,
window_size=90,
whitelist_token_ids={128821, 128822}, # <td>, </td> for table support
),
skip_special_tokens=False,
)
outputs = llm.generate(
[{"prompt": prompt, "multi_modal_data": {"image": image}}],
sampling_params
)
print(outputs[0].outputs[0].text)
```
### Batch Images
```python
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor]
)
image_paths = ["page1.png", "page2.png", "page3.png"]
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
model_input = [
{
"prompt": prompt,
"multi_modal_data": {"image": Image.open(p).convert("RGB")}
}
for p in image_paths
]
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(
ngram_size=30,
window_size=90,
whitelist_token_ids={128821, 128822},
),
skip_special_tokens=False,
)
outputs = llm.generate(model_input, sampling_params)
for path, output in zip(image_paths, outputs):
print(f"=== {path} ===")
print(output.outputs[0].text)
```
### PDF Processing (via vLLM scripts)
```bash
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
# Edit config.py: set INPUT_PATH, OUTPUT_PATH, model path, etc.
python run_dpsk_ocr_pdf.py # ~2500 tokens/s on A100-40G
```
### Benchmark Evaluation
```bash
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
python run_dpsk_ocr_eval_batch.py
```
---
## Inference: HuggingFace Transformers
```python
import os
import torch
from transformers import AutoModel, AutoTokenizer
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
_attn_implementation="flash_attention_2",
trust_remote_code=True,
use_safetensors=True,
)
model = model.eval().cuda().to(torch.bfloat16)
# Document to markdown
res = model.infer(
tokenizer,
prompt="<image>\n<|grounding|>Convert the document to markdown. ",
image_file="document.jpg",
output_path="./output/",
base_size=1024,
image_size=640,
crop_mode=True,
save_results=True,
test_compress=True,
)
print(res)
```
### Transformers Script
```bash
cd DeepSeek-OCR-master/DeepSeek-OCR-hf
python run_dpsk_ocr.py
```
---
## Prompt Reference
| Use Case | Prompt |
|---|---|
| Document → Markdown | `<image>\n<|grounding|>Convert the document to markdown. ` |
| General OCR | `<image>\n<|grounding|>OCR this image. ` |
| Free OCR (no layout) | `<image>\nFree OCR. ` |
| Parse figure/chart | `<image>\nParse the figure. ` |
| General description | `<image>\nDescribe this image in detail. ` |
| Grounded REC | `<image>\nLocate <\|ref\|>TARGET_TEXT<\|/ref\|> in the image. ` |
```python
PROMPTS = {
"document_markdown": "<image>\n<|grounding|>Convert the document to markdown. ",
"ocr_image": "<image>\n<|grounding|>OCR this image. ",
"free_ocr": "<image>\nFree OCR. ",
"parse_figure": "<image>\nParse the figure. ",
"describe": "<image>\nDescribe this image in detail. ",
"rec": "<image>\nLocate <|ref|>{target}<|/ref|> in the image. ",
}
```
---
## Supported Resolutions
| Mode | Resolution | Vision Tokens |
|---|---|---|
| Tiny | 512×512 | 64 |
| Small | 640×640 | 100 |
| Base | 1024×1024 | 256 |
| Large | 1280×1280 | 400 |
| Gundam (dynamic) | n×640×640 + 1×1024×1024 | variable |
```python
# Transformers: control resolution via infer() params
res = model.infer(
tokenizer,
prompt=prompt,
image_file="image.jpg",
base_size=1024, # 512, 640, 1024, or 1280
image_size=640, # patch size for dynamic mode
crop_mode=True, # True = Gundam dynamic resolution
)
```
---
## Configuration (vLLM)
Edit `DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py`:
```python
# Key config fields (example)
MODEL_PATH = "deepseek-ai/DeepSeek-OCR" # or local path
INPUT_PATH = "/data/input_images/"
OUTPUT_PATH = "/data/output/"
TENSOR_PARALLEL_SIZE = 1 # GPUs for tensor parallelism
MAX_TOKENS = 8192
TEMPERATURE = 0.0
NGRAM_SIZE = 30
WINDOW_SIZE = 90
```
---
## Common Patterns
### Process a Directory of Images
```python
import os
from pathlib import Path
from PIL import Image
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
def batch_ocr(image_dir: str, output_dir: str, prompt: str = "<image>\nFree OCR."):
Path(output_dir).mkdir(parents=True, exist_ok=True)
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor],
)
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
skip_special_tokens=False,
)
image_files = list(Path(image_dir).glob("*.png")) + list(Path(image_dir).glob("*.jpg"))
inputs = [
{"prompt": prompt, "multi_modal_data": {"image": Image.open(f).convert("RGB")}}
for f in image_files
]
outputs = llm.generate(inputs, sampling_params)
for img_path, output in zip(image_files, outputs):
out_file = Path(output_dir) / (img_path.stem + ".txt")
out_file.write_text(output.outputs[0].text)
print(f"Saved: {out_file}")
batch_ocr("/data/scans/", "/data/results/")
```
### Convert PDF Pages to Markdown
```python
import fitz # PyMuPDF
from PIL import Image
from io import BytesIO
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
def pdf_to_markdown(pdf_path: str) -> list[str]:
doc = fitz.open(pdf_path)
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor],
)
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(ngram_size=30, window_size=90, whitelist_token_ids={128821, 128822}),
skip_special_tokens=False,
)
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
inputs = []
for page in doc:
pix = page.get_pixmap(dpi=150)
img = Image.open(BytesIO(pix.tobytes("png"))).convert("RGB")
inputs.append({"prompt": prompt, "multi_modal_data": {"image": img}})
outputs = llm.generate(inputs, sampling_params)
return [o.outputs[0].text for o in outputs]
pages = pdf_to_markdown("report.pdf")
full_markdown = "\n\n---\n\n".join(pages)
print(full_markdown)
```
### Grounded Text Location (REC)
```python
import torch
from transformers import AutoModel, AutoTokenizer
model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
_attn_implementation="flash_attention_2",
trust_remote_code=True,
use_safetensors=True,
).eval().cuda().to(torch.bfloat16)
target = "Total Amount"
prompt = f"<image>\nLocate <|ref|>{target}<|/ref|> in the image. "
res = model.infer(
tokenizer,
prompt=prompt,
image_file="invoice.jpg",
output_path="./output/",
base_size=1024,
image_size=640,
crop_mode=False,
save_results=True,
)
print(res) # Returns bounding box / location info
```
---
## Troubleshooting
### `transformers` version conflict with vLLM
vLLM 0.8.5 requires `transformers>=4.51.1` — if running both in the same env, this error is safe to ignore per the project docs.
### Flash Attention build errors
```bash
# Ensure torch is installed before flash-attn
pip install flash-attn==2.7.3 --no-build-isolation
```
### CUDA out of memory
- Use smaller resolution: `base_size=512` or `base_size=640`
- Disable `crop_mode=False` to avoid multi-crop dynamic resolution
- Reduce batch size in vLLM inputs
### Model output is garbled / repetitive
Ensure `NGramPerReqLogitsProcessor` is passed to `LLM` — this is required for proper decoding:
```python
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
llm = LLM(..., logits_processors=[NGramPerReqLogitsProcessor])
```
### Tables not rendering correctly
Add table token IDs to the whitelist:
```python
whitelist_token_ids={128821, 128822} # <td> and </td>
```
### Multi-GPU inference
```python
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
tensor_parallel_size=4, # number of GPUs
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor],
)
```
---
## Key Files
```
DeepSeek-OCR-master/
├── DeepSeek-OCR-vllm/
│ ├── config.py # vLLM configuration
│ ├── run_dpsk_ocr_image.py # Single image inference
│ ├── run_dpsk_ocr_pdf.py # PDF batch inference
│ └── run_dpsk_ocr_eval_batch.py # Benchmark evaluation
└── DeepSeek-OCR-hf/
└── run_dpsk_ocr.py # HuggingFace Transformers inference
```Related Skills
DeepSeek Agent Skill
Integrates DeepSeek API with OpenClaw agents.
deepseek-web-query
使用 DeepSeek 网页版进行互联网查询,分担大模型请求和搜索负担。当用户需要查询最新信息、一般性知识、代码问题、文本分析等,或明确说"用 DeepSeek 查一下"、"联网搜索"、"查下最新"等时触发此技能。特别地,如果提问以"ds:"或"ds:"开头,优先使用此技能。通过 Chrome DevTools MCP 控制浏览器与 DeepSeek 交互,自动检测登录状态并提示用户,保持浏览器会话复用,使用 evaluate_script 提取 Markdown 内容直接返回。
deepseek-v3-lite-agent
You are DeepSeek-V3-Agent,an effective content creator.---# `current_date: $DATE$`
---
name: article-factory-wechat
humanizer
Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.
find-skills
Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.
tavily-search
Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.
baidu-search
Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.
agent-autonomy-kit
Stop waiting for prompts. Keep working.
Meeting Prep
Never walk into a meeting unprepared again. Your agent researches all attendees before calendar events—pulling LinkedIn profiles, recent company news, mutual connections, and conversation starters. Generates a briefing doc with talking points, icebreakers, and context so you show up informed and confident. Triggered automatically before meetings or on-demand. Configure research depth, advance timing, and output format. Walking into meetings blind is amateur hour—missed connections, generic small talk, zero leverage. Use when setting up meeting intelligence, researching specific attendees, generating pre-meeting briefs, or automating your prep workflow.
self-improvement
Captures learnings, errors, and corrections to enable continuous improvement. Use when: (1) A command or operation fails unexpectedly, (2) User corrects Claude ('No, that's wrong...', 'Actually...'), (3) User requests a capability that doesn't exist, (4) An external API or tool fails, (5) Claude realizes its knowledge is outdated or incorrect, (6) A better approach is discovered for a recurring task. Also review learnings before major tasks.
botlearn-healthcheck
botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.