kordoc-korean-document-parser

Parse HWP, HWPX, and PDF Korean documents to Markdown using kordoc — supports CLI, programmatic API, and MCP server integration.

22 stars

byAradotso

View on GitHub Installation ↓

Best use case

kordoc-korean-document-parser is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Parse HWP, HWPX, and PDF Korean documents to Markdown using kordoc — supports CLI, programmatic API, and MCP server integration.

Teams using kordoc-korean-document-parser should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/kordoc-korean-document-parser/SKILL.md --create-dirs "https://raw.githubusercontent.com/Aradotso/trending-skills/main/skills/kordoc-korean-document-parser/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/kordoc-korean-document-parser/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How kordoc-korean-document-parser Compares

Feature / Agent	kordoc-korean-document-parser	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Parse HWP, HWPX, and PDF Korean documents to Markdown using kordoc — supports CLI, programmatic API, and MCP server integration.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# kordoc Korean Document Parser

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

kordoc is a TypeScript library and CLI for parsing Korean government documents (HWP 5.x, HWPX, PDF) into Markdown and structured `IRBlock[]` data. It handles proprietary HWP binary formats, table extraction, form field recognition, document diffing, and reverse Markdown→HWPX generation.

---

## Installation

```bash
# Core library
npm install kordoc

# PDF support (optional peer dependency)
npm install pdfjs-dist

# CLI (no install needed)
npx kordoc document.hwpx
```

---

## Core API

### Auto-detect and Parse Any Document

```typescript
import { parse } from "kordoc"
import { readFileSync } from "fs"

const buffer = readFileSync("document.hwpx")
const result = await parse(buffer.buffer) // ArrayBuffer required

if (result.success) {
  console.log(result.markdown)   // string: full Markdown
  console.log(result.blocks)     // IRBlock[]: structured data
  console.log(result.metadata)   // { title, author, createdAt, pageCount, ... }
  console.log(result.outline)    // OutlineItem[]: document structure
  console.log(result.warnings)   // ParseWarning[]: skipped elements
} else {
  console.error(result.error)    // string message
  console.error(result.code)     // ErrorCode: "ENCRYPTED" | "ZIP_BOMB" | "IMAGE_BASED_PDF" | ...
}
```

### Format-Specific Parsers

```typescript
import { parseHwpx, parseHwp, parsePdf, detectFormat } from "kordoc"

// Detect format first
const fmt = detectFormat(buffer.buffer) // "hwpx" | "hwp" | "pdf" | "unknown"

// Parse by format
const hwpxResult = await parseHwpx(buffer.buffer)
const hwpResult  = await parseHwp(buffer.buffer)
const pdfResult  = await parsePdf(buffer.buffer)
```

### Parse Options

```typescript
import { parse, ParseOptions } from "kordoc"

const result = await parse(buffer.buffer, {
  pages: "1-3",          // page range string
  // pages: [1, 5, 10], // or specific page numbers
  ocr: async (pageImage, pageNumber, mimeType) => {
    // Pluggable OCR for image-based PDFs
    // pageImage: ArrayBuffer of the page image
    return await myOcrService.recognize(pageImage)
  }
})
```

---

## Working with IRBlocks

```typescript
import type { IRBlock, IRBlockType, IRTable, IRCell } from "kordoc"

// IRBlock types: "heading" | "paragraph" | "table" | "list" | "image" | "separator"
for (const block of result.blocks) {
  if (block.type === "heading") {
    console.log(`H${block.level}: ${block.text}`)
    console.log(block.bbox)       // { x, y, width, height, page }
  }

  if (block.type === "table") {
    const table = block as IRTable
    for (const row of table.rows) {
      for (const cell of row) {
        console.log(cell.text, cell.colspan, cell.rowspan)
      }
    }
  }

  if (block.type === "paragraph") {
    console.log(block.text)
    console.log(block.style)      // InlineStyle: { bold, italic, fontSize, ... }
    console.log(block.pageNumber)
  }
}
```

### Convert Blocks Back to Markdown

```typescript
import { blocksToMarkdown } from "kordoc"

const markdown = blocksToMarkdown(result.blocks)
```

---

## Document Comparison

```typescript
import { compare } from "kordoc"

const bufA = readFileSync("v1.hwp").buffer
const bufB = readFileSync("v2.hwpx").buffer  // cross-format supported

const diff = await compare(bufA, bufB)

console.log(diff.stats)
// { added: 3, removed: 1, modified: 5, unchanged: 42 }

for (const d of diff.diffs) {
  // d.type: "added" | "removed" | "modified" | "unchanged"
  // d.blockA, d.blockB: IRBlock
  // d.cellDiffs: CellDiff[] for table blocks
  console.log(d.type, d.blockA?.text ?? d.blockB?.text)
}
```

---

## Form Field Extraction

```typescript
import { parse, extractFormFields } from "kordoc"

const result = await parse(buffer.buffer)
if (result.success) {
  const form = extractFormFields(result.blocks)

  console.log(form.confidence)  // 0.0–1.0
  for (const field of form.fields) {
    // { label: "성명", value: "홍길동", row: 0, col: 0 }
    console.log(`${field.label}: ${field.value}`)
  }
}
```

---

## Markdown → HWPX Generation

```typescript
import { markdownToHwpx } from "kordoc"
import { writeFileSync } from "fs"

const markdown = `
# 제목

본문 내용입니다.

| 구분 | 내용 |
| --- | --- |
| 항목1 | 값1 |
| 항목2 | 값2 |
`

const hwpxBuffer = await markdownToHwpx(markdown)
writeFileSync("output.hwpx", Buffer.from(hwpxBuffer))
```

---

## CLI Usage

```bash
# Basic conversion — output to stdout
npx kordoc document.hwpx

# Save to file
npx kordoc document.hwp -o output.md

# Batch convert all PDFs to a directory
npx kordoc *.pdf -d ./converted/

# JSON output with blocks + metadata
npx kordoc report.hwpx --format json

# Parse specific pages only
npx kordoc report.hwpx --pages 1-3

# Watch mode — auto-convert new files
npx kordoc watch ./incoming -d ./output

# Watch with webhook notification on conversion
npx kordoc watch ./docs --webhook https://api.example.com/hook
```

---

## MCP Server Setup

Add to your MCP config (Claude Desktop, Cursor, Windsurf):

```json
{
  "mcpServers": {
    "kordoc": {
      "command": "npx",
      "args": ["-y", "kordoc-mcp"]
    }
  }
}
```

### Available MCP Tools

| Tool | Description |
|------|-------------|
| `parse_document` | Parse HWP/HWPX/PDF → Markdown + metadata + outline + warnings |
| `detect_format` | Detect file format via magic bytes |
| `parse_metadata` | Extract only metadata (fast, no full parse) |
| `parse_pages` | Parse a specific page range |
| `parse_table` | Extract the Nth table from a document |
| `compare_documents` | Diff two documents (cross-format supported) |
| `parse_form` | Extract form fields as structured JSON |

---

## TypeScript Types Reference

```typescript
import type {
  // Results
  ParseResult, ParseSuccess, ParseFailure,
  ErrorCode,        // "ENCRYPTED" | "ZIP_BOMB" | "IMAGE_BASED_PDF" | ...

  // Blocks
  IRBlock, IRBlockType, IRTable, IRCell, CellContext,

  // Metadata & structure
  DocumentMetadata, OutlineItem,
  ParseWarning, WarningCode,
  BoundingBox,      // { x, y, width, height, page }
  InlineStyle,      // { bold, italic, fontSize, color, ... }

  // Options
  ParseOptions, FileType,
  OcrProvider,      // async (image, pageNum, mime) => string
  WatchOptions,

  // Diff
  DiffResult, BlockDiff, CellDiff, DiffChangeType,

  // Forms
  FormField, FormResult,
} from "kordoc"
```

---

## Common Patterns

### Batch Process Files with Error Handling

```typescript
import { parse, detectFormat } from "kordoc"
import { readFileSync } from "fs"
import { glob } from "glob"

const files = await glob("./docs/**/*.{hwp,hwpx,pdf}")

for (const file of files) {
  const buffer = readFileSync(file)
  const fmt = detectFormat(buffer.buffer)

  if (fmt === "unknown") {
    console.warn(`Skipping unknown format: ${file}`)
    continue
  }

  const result = await parse(buffer.buffer)

  if (!result.success) {
    if (result.code === "ENCRYPTED") {
      console.warn(`Encrypted, skipping: ${file}`)
    } else if (result.code === "IMAGE_BASED_PDF") {
      console.warn(`Image-based PDF needs OCR: ${file}`)
    } else {
      console.error(`Failed: ${file} — ${result.error}`)
    }
    continue
  }

  console.log(`Parsed ${file}: ${result.blocks.length} blocks`)
}
```

### Extract All Tables from a Document

```typescript
import { parse } from "kordoc"
import type { IRTable } from "kordoc"

const result = await parse(buffer.buffer)
if (result.success) {
  const tables = result.blocks.filter(b => b.type === "table") as IRTable[]

  tables.forEach((table, i) => {
    console.log(`\n--- Table ${i + 1} ---`)
    for (const row of table.rows) {
      const cells = row.map(cell => cell.text.trim()).join(" | ")
      console.log(`| ${cells} |`)
    }
  })
}
```

### OCR with Tesseract.js

```typescript
import { parse } from "kordoc"
import Tesseract from "tesseract.js"

const result = await parse(buffer.buffer, {
  ocr: async (pageImage, pageNumber, mimeType) => {
    const blob = new Blob([pageImage], { type: mimeType })
    const url = URL.createObjectURL(blob)
    const { data } = await Tesseract.recognize(url, "kor+eng")
    URL.revokeObjectURL(url)
    return data.text
  }
})
```

### Watch Mode Programmatic API

```typescript
import { watch } from "kordoc"

const watcher = watch("./incoming", {
  output: "./converted",
  webhook: process.env.WEBHOOK_URL,
  onFile: async (file, result) => {
    if (result.success) {
      console.log(`Converted: ${file}`)
    }
  }
})

// Stop watching
watcher.stop()
```

---

## Troubleshooting

**`buffer.buffer` vs `Buffer`** — kordoc requires `ArrayBuffer`, not Node.js `Buffer`. Always pass `readFileSync("file").buffer` or use `.buffer` on a `Uint8Array`.

**PDF tables not detected** — Line-based detection requires pdfjs-dist installed. Install it: `npm install pdfjs-dist`. For borderless tables, kordoc uses cluster-based heuristics automatically.

**`"IMAGE_BASED_PDF"` error** — The PDF contains scanned images with no text layer. Provide an `ocr` function in parse options.

**`"ENCRYPTED"` error** — HWP DRM/password-protected files cannot be parsed without the decryption key. No workaround.

**Korean characters garbled in output** — Ensure your terminal/file uses UTF-8 encoding. kordoc outputs UTF-8 Markdown by default.

**Large files are slow** — Use `pages` option to parse only needed pages: `parse(buf, { pages: "1-5" })`. Metadata-only extraction is faster: `parse_metadata` MCP tool or check `result.metadata` directly.

**HWP table columns wrong** — Update to v1.6.1+. Earlier versions had a 2-byte offset misalignment in LIST_HEADER parsing causing column explosion.

Related Skills

I'm not able to create a skill document for this project.

from Aradotso/trending-skills

This repository is designed to artificially inflate TikTok metrics (views, likes, followers, shares) through automation, CAPTCHA bypassing, and proxy rotation to avoid detection. This is:

privacy-parser-pii-extraction

from Aradotso/trending-skills

Extract structured PII spans from text using the OpenAI Privacy Filter 1.5B model reversed — returns what, where, and which type instead of masking.

kami-document-design

from Aradotso/trending-skills

Design system and AI skill for generating beautiful warm-parchment documents (resumes, slides, one-pagers, portfolios, letters, white papers) with a consistent editorial aesthetic.

k-skill-korean-ai-tools

from Aradotso/trending-skills

AI 에이전트를 위한 한국 서비스 자동화 스킬 모음 — SRT/KTX 예매, KBO, 로또, 카카오톡, 지하철, HWP, 우편번호 등

humanize-korean-ai-text

from Aradotso/trending-skills

AI가 쓴 한글 글을 사람이 쓴 것처럼 윤문해주는 Claude Code 스킬 — 번역투·관용구·구조적 AI 패턴 40+ 서브 패턴 탐지 및 수술적 수정

```markdown

from Aradotso/trending-skills

---

zeroboot-vm-sandbox

from Aradotso/trending-skills

Sub-millisecond VM sandboxes for AI agents using copy-on-write KVM forking via Zeroboot

yourvpndead-vpn-detection

from Aradotso/trending-skills

Android app that detects VPN/proxy servers (VLESS/xray/sing-box) via local SOCKS5 vulnerability, exposing exit IPs and server configs without root

xata-postgres-platform

from Aradotso/trending-skills

Expert skill for Xata open-source cloud-native Postgres platform with copy-on-write branching, scale-to-zero, and Kubernetes deployment

x-mentor-skill-nuwa

from Aradotso/trending-skills

AI-powered X (Twitter) content strategy skill that distills methodologies from 6 top creators + open-source algorithm data into actionable writing, growth, and monetization guidance.

wx-favorites-report

from Aradotso/trending-skills

End-to-end pipeline to extract, decrypt, and visualize WeChat Mac favorites from encrypted SQLite DB into an interactive HTML report.

wterm-web-terminal

from Aradotso/trending-skills

Web terminal emulator with Zig/WASM core, DOM rendering, and React/vanilla JS bindings