article-extractor

Extract clean article content from URLs (blog posts, articles, tutorials) and save as readable text. Use when user wants to download, extract, or save an article/blog post from a URL without ads, navigation, or clutter.

151 stars

bynicepkg

View on GitHub Installation ↓

Best use case

article-extractor is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using article-extractor should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/article-extractor/SKILL.md --create-dirs "https://raw.githubusercontent.com/nicepkg/ai-workflow/main/workflows/content-creator-workflow/.claude/skills/article-extractor/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/article-extractor/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How article-extractor Compares

Feature / Agent	article-extractor	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# Article Extractor

This skill extracts the main content from web articles and blog posts, removing navigation, ads, newsletter signups, and other clutter. Saves clean, readable text.

## When to Use This Skill

Activate when the user:
- Provides an article/blog URL and wants the text content
- Asks to "download this article"
- Wants to "extract the content from [URL]"
- Asks to "save this blog post as text"
- Needs clean article text without distractions

## How It Works

### Priority Order:
1. **Check if tools are installed** (reader or trafilatura)
2. **Download and extract article** using best available tool
3. **Clean up the content** (remove extra whitespace, format properly)
4. **Save to file** with article title as filename
5. **Confirm location** and show preview

## Installation Check

Check for article extraction tools in this order:

### Option 1: reader (Recommended - Mozilla's Readability)

```bash
command -v reader
```

If not installed:
```bash
npm install -g @mozilla/readability-cli
# or
npm install -g reader-cli
```

### Option 2: trafilatura (Python-based, very good)

```bash
command -v trafilatura
```

If not installed:
```bash
pip3 install trafilatura
```

### Option 3: Fallback (curl + simple parsing)

If no tools available, use basic curl + text extraction (less reliable but works)

## Extraction Methods

### Method 1: Using reader (Best for most articles)

```bash
# Extract article
reader "URL" > article.txt
```

**Pros:**
- Based on Mozilla's Readability algorithm
- Excellent at removing clutter
- Preserves article structure

### Method 2: Using trafilatura (Best for blogs/news)

```bash
# Extract article
trafilatura --URL "URL" --output-format txt > article.txt

# Or with more options
trafilatura --URL "URL" --output-format txt --no-comments --no-tables > article.txt
```

**Pros:**
- Very accurate extraction
- Good with various site structures
- Handles multiple languages

**Options:**
- `--no-comments`: Skip comment sections
- `--no-tables`: Skip data tables
- `--precision`: Favor precision over recall
- `--recall`: Extract more content (may include some noise)

### Method 3: Fallback (curl + basic parsing)

```bash
# Download and extract basic content
curl -s "URL" | python3 -c "
from html.parser import HTMLParser
import sys

class ArticleExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_content = False
        self.content = []
        self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside'}
        self.current_tag = None

    def handle_starttag(self, tag, attrs):
        if tag not in self.skip_tags:
            if tag in {'p', 'article', 'main', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6'}:
                self.in_content = True
        self.current_tag = tag

    def handle_data(self, data):
        if self.in_content and data.strip():
            self.content.append(data.strip())

    def get_content(self):
        return '\n\n'.join(self.content)

parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > article.txt
```

**Note:** This is less reliable but works without dependencies.

## Getting Article Title

Extract title for filename:

### Using reader:
```bash
# reader outputs markdown with title at top
TITLE=$(reader "URL" | head -n 1 | sed 's/^# //')
```

### Using trafilatura:
```bash
# Get metadata including title
TITLE=$(trafilatura --URL "URL" --json | python3 -c "import json, sys; print(json.load(sys.stdin)['title'])")
```

### Using curl (fallback):
```bash
TITLE=$(curl -s "URL" | grep -oP '<title>\K[^<]+' | sed 's/ - .*//' | sed 's/ | .*//')
```

## Filename Creation

Clean title for filesystem:

```bash
# Get title
TITLE="Article Title from Website"

# Clean for filesystem (remove special chars, limit length)
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<' '' | tr '>' '' | tr '|' '-' | cut -c 1-100 | sed 's/ *$//')

# Add extension
FILENAME="${FILENAME}.txt"
```

## Complete Workflow

```bash
ARTICLE_URL="https://example.com/article"

# Check for tools
if command -v reader &> /dev/null; then
    TOOL="reader"
    echo "Using reader (Mozilla Readability)"
elif command -v trafilatura &> /dev/null; then
    TOOL="trafilatura"
    echo "Using trafilatura"
else
    TOOL="fallback"
    echo "Using fallback method (may be less accurate)"
fi

# Extract article
case $TOOL in
    reader)
        # Get content
        reader "$ARTICLE_URL" > temp_article.txt

        # Get title (first line after # in markdown)
        TITLE=$(head -n 1 temp_article.txt | sed 's/^# //')
        ;;

    trafilatura)
        # Get title from metadata
        METADATA=$(trafilatura --URL "$ARTICLE_URL" --json)
        TITLE=$(echo "$METADATA" | python3 -c "import json, sys; print(json.load(sys.stdin).get('title', 'Article'))")

        # Get clean content
        trafilatura --URL "$ARTICLE_URL" --output-format txt --no-comments > temp_article.txt
        ;;

    fallback)
        # Get title
        TITLE=$(curl -s "$ARTICLE_URL" | grep -oP '<title>\K[^<]+' | head -n 1)
        TITLE=${TITLE%% - *}  # Remove site name
        TITLE=${TITLE%% | *}  # Remove site name (alternate)

        # Get content (basic extraction)
        curl -s "$ARTICLE_URL" | python3 -c "
from html.parser import HTMLParser
import sys

class ArticleExtractor(HTMLParser):
    def __init__(self):
        super().__init__()
        self.in_content = False
        self.content = []
        self.skip_tags = {'script', 'style', 'nav', 'header', 'footer', 'aside', 'form'}

    def handle_starttag(self, tag, attrs):
        if tag not in self.skip_tags:
            if tag in {'p', 'article', 'main'}:
                self.in_content = True
        if tag in {'h1', 'h2', 'h3'}:
            self.content.append('\n')

    def handle_data(self, data):
        if self.in_content and data.strip():
            self.content.append(data.strip())

    def get_content(self):
        return '\n\n'.join(self.content)

parser = ArticleExtractor()
parser.feed(sys.stdin.read())
print(parser.get_content())
" > temp_article.txt
        ;;
esac

# Clean filename
FILENAME=$(echo "$TITLE" | tr '/' '-' | tr ':' '-' | tr '?' '' | tr '"' '' | tr '<>' '' | tr '|' '-' | cut -c 1-80 | sed 's/ *$//' | sed 's/^ *//')
FILENAME="${FILENAME}.txt"

# Move to final filename
mv temp_article.txt "$FILENAME"

# Show result
echo "✓ Extracted article: $TITLE"
echo "✓ Saved to: $FILENAME"
echo ""
echo "Preview (first 10 lines):"
head -n 10 "$FILENAME"
```

## Error Handling

### Common Issues

**1. Tool not installed**
- Try alternate tool (reader → trafilatura → fallback)
- Offer to install: "Install reader with: npm install -g reader-cli"

**2. Paywall or login required**
- Extraction tools may fail
- Inform user: "This article requires authentication. Cannot extract."

**3. Invalid URL**
- Check URL format
- Try with and without redirects

**4. No content extracted**
- Site may use heavy JavaScript
- Try fallback method
- Inform user if extraction fails

**5. Special characters in title**
- Clean title for filesystem
- Remove: `/`, `:`, `?`, `"`, `<`, `>`, `|`
- Replace with `-` or remove

## Output Format

### Saved File Contains:
- Article title (if available)
- Author (if available from tool)
- Main article text
- Section headings
- No navigation, ads, or clutter

### What Gets Removed:
- Navigation menus
- Ads and promotional content
- Newsletter signup forms
- Related articles sidebars
- Comment sections (optional)
- Social media buttons
- Cookie notices

## Tips for Best Results

**1. Use reader for most articles**
- Best all-around tool
- Based on Firefox Reader View
- Works on most news sites and blogs

**2. Use trafilatura for:**
- Academic articles
- News sites
- Blogs with complex layouts
- Non-English content

**3. Fallback method limitations:**
- May include some noise
- Less accurate paragraph detection
- Better than nothing for simple sites

**4. Check extraction quality:**
- Always show preview to user
- Ask if it looks correct
- Offer to try different tool if needed

## Example Usage

**Simple extraction:**
```bash
# User: "Extract https://example.com/article"
reader "https://example.com/article" > temp.txt
TITLE=$(head -n 1 temp.txt | sed 's/^# //')
FILENAME="$(echo "$TITLE" | tr '/' '-').txt"
mv temp.txt "$FILENAME"
echo "✓ Saved to: $FILENAME"
```

**With error handling:**
```bash
if ! reader "$URL" > temp.txt 2>/dev/null; then
    if command -v trafilatura &> /dev/null; then
        trafilatura --URL "$URL" --output-format txt > temp.txt
    else
        echo "Error: Could not extract article. Install reader or trafilatura."
        exit 1
    fi
fi
```

## Best Practices

- ✅ Always show preview after extraction (first 10 lines)
- ✅ Verify extraction succeeded before saving
- ✅ Clean filename for filesystem compatibility
- ✅ Try fallback method if primary fails
- ✅ Inform user which tool was used
- ✅ Keep filename length reasonable (< 100 chars)

## After Extraction

Display to user:
1. "✓ Extracted: [Article Title]"
2. "✓ Saved to: [filename]"
3. Show preview (first 10-15 lines)
4. File size and location

Ask if needed:
- "Would you like me to also create a Ship-Learn-Next plan from this?" (if using ship-learn-next skill)
- "Should I extract another article?"

Related Skills

color-palette-extractor

151

from nicepkg/ai-workflow

Extract color palettes from images, websites, or designs. Identifies dominant colors, generates complementary schemes, and exports in multiple formats (HEX, RGB, HSL, Tailwind, CSS variables). Use when users need color schemes from images, brand colors, or design system palettes.

competitive-ads-extractor

151

from nicepkg/ai-workflow

Extracts and analyzes competitors' ads from ad libraries (Facebook, LinkedIn, etc.) to understand what messaging, problems, and creative approaches are working. Helps inspire and improve your own ad campaigns.

youtube-to-markdown

151

from nicepkg/ai-workflow

Use when user asks YouTube video extraction, get, fetch, transcripts, subtitles, or captions. Writes video details and transcription into structured markdown file.

youtube-seo-optimizer

151

from nicepkg/ai-workflow

Optimize YouTube videos for search and discovery. Generates SEO-optimized titles, descriptions, tags, hashtags, and chapters. Includes keyword research and competitor analysis. Use when publishing videos, improving discoverability, or optimizing existing content.

webfluence

151

from nicepkg/ai-workflow

Content web architecture framework. Use when diagnosing offer doc usage, content-to-conversion pathways, or why someone isn't getting sales despite traffic.

video-to-gif

151

from nicepkg/ai-workflow

Convert video clips to optimized GIFs with speed control, cropping, text overlays, and file size optimization. Create perfect GIFs for social media, documentation, and presentations.

video-title-optimizer

151

from nicepkg/ai-workflow

Optimize video titles for maximum click-through rate (CTR) and YouTube/TikTok SEO. Generates multiple title variations balancing curiosity, keywords, and platform best practices. Use when naming videos, improving CTR, or A/B testing titles.

video-script-writer

151

from nicepkg/ai-workflow

Write engaging video scripts for YouTube, TikTok, and other platforms. Creates complete scripts with hooks, main content, and CTAs. Supports various formats including tutorials, vlogs, reviews, explainers, and storytelling. Use when creating video scripts, writing YouTube content, or planning video structure.

video-script-collaborial

151

from nicepkg/ai-workflow

将视频脚本转换为更适合实际录制的口语化表达，去除书面化语言，增加自然感和亲和力。当用户提到"视频脚本"、"录制"、"口语化"、"自然一点"、"像说话一样"、"太书面了"时使用此技能。

video-hook-generator

151

from nicepkg/ai-workflow

Generate attention-grabbing hooks for the first 3 seconds of videos. The hook determines if viewers stay or scroll. Creates multiple hook variations for A/B testing. Use when crafting video openings, improving retention, or creating scroll-stopping content for YouTube, TikTok, or Reels.

youtube-downloader

151

from nicepkg/ai-workflow

Download YouTube videos with customizable quality and format options. Use this skill when the user asks to download, save, or grab YouTube videos. Supports various quality settings (best, 1080p, 720p, 480p, 360p), multiple formats (mp4, webm, mkv), and audio-only downloads as MP3.

video-comparer

151

from nicepkg/ai-workflow

This skill should be used when comparing two videos to analyze compression results or quality differences. Generates interactive HTML reports with quality metrics (PSNR, SSIM) and frame-by-frame visual comparisons. Triggers when users mention "compare videos", "video quality", "compression analysis", "before/after compression", or request quality assessment of compressed videos.