Facebook Page & Group Scraper

> Part of **[ScrapeClaw](https://www.scrapeclaw.cc/)** — a suite of production-ready, agentic social media scrapers for Instagram, YouTube, X/Twitter, and Facebook built with Python & Playwright, no API keys required.

3,891 stars

Best use case

Facebook Page & Group Scraper is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

> Part of **[ScrapeClaw](https://www.scrapeclaw.cc/)** — a suite of production-ready, agentic social media scrapers for Instagram, YouTube, X/Twitter, and Facebook built with Python & Playwright, no API keys required.

Teams using Facebook Page & Group Scraper should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/facebook-scraper/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/arulmozhiv/facebook-scraper/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/facebook-scraper/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How Facebook Page & Group Scraper Compares

Feature / AgentFacebook Page & Group ScraperStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

> Part of **[ScrapeClaw](https://www.scrapeclaw.cc/)** — a suite of production-ready, agentic social media scrapers for Instagram, YouTube, X/Twitter, and Facebook built with Python & Playwright, no API keys required.

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# Facebook Page & Group Scraper

> Part of **[ScrapeClaw](https://www.scrapeclaw.cc/)** — a suite of production-ready, agentic social media scrapers for Instagram, YouTube, X/Twitter, and Facebook built with Python & Playwright, no API keys required.

A browser-based Facebook page and group discovery and scraping tool.

```yaml
---
name: facebook-scraper
description: Discover and scrape Facebook pages and public groups from your browser.
emoji: 📘
version: 1.0.0
author: influenza
tags:
  - facebook
  - scraping
  - social-media
  - page-discovery
  - group-discovery
  - business-pages
metadata:
  clawdbot:
    requires:
      bins:
        - python3
        - chromium

    config:
      stateDirs:
        - data/output
        - data/queue
        - thumbnails
      outputFormats:
        - json
        - csv
---
```

## Overview

This skill provides a two-phase Facebook scraping system:

1. **Page/Group Discovery**  
2. **Browser Scraping** 

## Features

- 🔍  - Discover Facebook pages and groups by location and category
- 🌐  - Full browser simulation for accurate scraping
- 🛡️  - Browser fingerprinting, human behavior simulation, and stealth scripts
- 📊  - Page/group info, stats, images, and engagement data
- 💾  - JSON/CSV export with downloaded thumbnails
- 🔄  - Resume interrupted scraping sessions
- ⚡  - Auto-skip private groups, low-like pages, empty profiles
- 📂  - Supports pages, groups, and public profiles via --type flag

#### Getting Google API Credentials (Optional)

1. Go to [Google Cloud Console](https://console.cloud.google.com/)
2. Create a new project or select existing
3. Enable "Custom Search API"
4. Create API credentials → API Key
5. Go to [Programmable Search Engine](https://programmablesearchengine.google.com/)
6. Create a search engine with `facebook.com` as the site to search
7. Copy the Search Engine ID

## Usage

### Agent Tool Interface

For OpenClaw agent integration, the skill provides JSON output:

```bash
# Discover Facebook pages (returns JSON)
discover --location "Miami" --category "restaurant" --type page --output json

# Discover Facebook groups (returns JSON)
discover --location "New York" --category "fitness" --type group --output json

# Scrape single page (returns JSON)
scrape --page-name examplebusiness --output json

# Scrape single group (returns JSON)
scrape --page-name examplegroup --type group --output json
```

## Output Data

### Page/Group Data Structure

```json
{
  "page_name": "example_business",
  "display_name": "Example Business",
  "entity_type": "page",
  "category": "Restaurant",
  "subcategory": "Italian Restaurant",
  "about": "Family-owned Italian restaurant since 1985",
  "followers": 45000,
  "page_likes": 42000,
  "location": "Miami, FL",
  "address": "123 Main St, Miami, FL 33101",
  "phone": "+1-555-0123",
  "email": "info@example.com",
  "website": "https://example.com",
  "hours": "Mon-Sat 11AM-10PM",
  "is_verified": false,
  "page_tier": "mid",
  "profile_pic_local": "thumbnails/example_business/profile_abc123.jpg",
  "cover_photo_local": "thumbnails/example_business/cover_def456.jpg",
  "recent_posts": [
    {"post_url": "https://facebook.com/example_business/posts/123", "reactions": 320, "comments": 45, "shares": 12}
  ],
  "scrape_timestamp": "2026-02-20T14:30:00"
}
```

### Group Data Structure

```json
{
  "page_name": "example_group",
  "display_name": "Miami Fitness Community",
  "entity_type": "group",
  "about": "A community for fitness enthusiasts in Miami",
  "members": 15000,
  "privacy": "Public",
  "posts_per_day": 25,
  "location": "Miami",
  "page_tier": "mid",
  "profile_pic_local": "thumbnails/example_group/profile_abc123.jpg",
  "cover_photo_local": "thumbnails/example_group/cover_def456.jpg",
  "scrape_timestamp": "2026-02-20T14:30:00"
}
```

### Page Tiers

| Tier  | Likes/Members Range |
|-------|---------------------|
| nano  | < 1,000             |
| micro | 1,000 - 10,000      |
| mid   | 10,000 - 100,000    |
| macro | 100,000 - 1M        |
| mega  | > 1,000,000         |

### File Outputs

- **Queue files**: `data/queue/{location}_{category}_{type}_{timestamp}.json`
- **Scraped data**: `data/output/{page_name}.json`
- **Thumbnails**: `thumbnails/{page_name}/profile_*.jpg`, `thumbnails/{page_name}/cover_*.jpg`
- **Export files**: `data/export_{timestamp}.json`, `data/export_{timestamp}.csv`

## Configuration

Edit `config/scraper_config.json`:

```json
{
  "google_search": {
    "enabled": true,
    "api_key": "",
    "search_engine_id": "",
    "queries_per_location": 3
  },
  "scraper": {
    "headless": false,
    "min_likes": 1000,
    "download_thumbnails": true,
    "max_thumbnails": 6
  },
  "cities": ["New York", "Los Angeles", "Miami", "Chicago"],
  "categories": ["restaurant", "retail", "fitness", "real-estate", "healthcare", "beauty"]
}
```

## Filters Applied

The scraper automatically filters out:

- ❌ Private groups
- ❌ Pages with < 1,000 likes (configurable)
- ❌ Deactivated or removed pages
- ❌ Non-existent pages/groups
- ❌ Already scraped entries (deduplication)

## Troubleshooting

### Login Issues

- Ensure credentials are correct
- Handle verification codes when prompted
- Wait if rate limited (the script will auto-retry)

### No Pages Discovered

- Check Google API key and quota
- Verify Search Engine ID is configured for facebook.com
- Try different location/category combinations

### Rate Limiting

- Reduce scraping speed (increase delays)
- Use multiple Facebook accounts
- Run during off-peak hours
- **Use a residential proxy** (see below)

---

## 🌐 Residential Proxy Support

### Why Use a Residential Proxy?

Running a scraper at scale **without** a residential proxy will get your IP blocked fast. Here's why proxies are essential for long-running scrapes:

| Advantage | Description |
|-----------|-------------|
| **Avoid IP Bans** | Residential IPs look like real household users, not data-center bots. Facebook is far less likely to flag them. |
| **Automatic IP Rotation** | Each request (or session) gets a fresh IP, so rate-limits never stack up on one address. |
| **Geo-Targeting** | Route traffic through a specific country/city so scraped content matches the target audience's locale. |
| **Sticky Sessions** | Keep the same IP for a configurable window (e.g. 10 min) — critical for maintaining a Facebook login session. |
| **Higher Success Rate** | Rotating residential IPs deliver 95%+ success rates compared to ~30% with data-center proxies on Facebook. |
| **Long-Running Scrapes** | Scrape thousands of pages/groups over hours or days without interruption. |
| **Concurrent Scraping** | Run multiple browser instances across different IPs simultaneously. |

### Recommended Proxy Providers

We have affiliate partnerships with top residential proxy providers. Using these links supports continued development of this skill:

| Provider | Best For | Sign Up |
|----------|----------|---------|
| **Bright Data** | World's largest residential network, 72M+ IPs, enterprise-grade | 👉 [**Sign Up for Bright Data**](https://get.brightdata.com/o1kpd2da8iv4) |
| **IProyal** | Premium residential pool, pay-as-you-go, 195+ countries | 👉 [**Sign Up for IProyal**](https://iproyal.com/?r=ScrapeClaw) |
| **Storm Proxies** | Fast & reliable residential IPs, developer-friendly API | 👉 [**Sign Up for Storm Proxies**](https://stormproxies.com/clients/aff/go/scrapeclaw) |
| **NetNut** | ISP-grade residential network, 52M+ IPs, direct connectivity | 👉 [**Sign Up for NetNut**](https://netnut.io?ref=mwrlzwv) |


### Setup Steps

#### 1. Get Your Proxy Credentials

Sign up with any provider above, then grab:
- **Username** (from your provider dashboard)
- **Password** (from your provider dashboard)
- **Host** and **Port** are pre-configured per provider (or use custom)

#### 2. Configure Entirely via Environment Variables

```bash
export PROXY_ENABLED=true
export PROXY_PROVIDER=netnut       # brightdata | iproyal | stormproxies | netnut | custom
export PROXY_USERNAME=your_user
export PROXY_PASSWORD=your_pass
export PROXY_COUNTRY=us            # optional: two-letter country code
export PROXY_STICKY=true           # optional: keep same IP per session
```

#### 3. Provider-Specific Host/Port Defaults

These are auto-configured when you set the `provider` name:

| Provider | Host | Port |
|----------|------|------|
| Bright Data | `brd.superproxy.io` | `22225` |
| IProyal | `proxy.iproyal.com` | `12321` |
| Storm Proxies | `rotating.stormproxies.com` | `9999` |
| NetNut | `gw-resi.netnut.io` | `5959` |

Override with `"host"` and `"port"` in config or `PROXY_HOST` / `PROXY_PORT` env vars if your plan uses a different gateway.

#### 4. Custom Proxy Provider

For any other proxy service, set provider to `custom` and supply host/port manually:

```json
{
  "proxy": {
    "enabled": true,
    "provider": "custom",
    "host": "your.proxy.host",
    "port": 8080,
    "username": "user",
    "password": "pass"
  }
}
```

### Running the Scraper with Proxy

Once configured, the scraper picks up the proxy automatically — no extra flags needed:

```bash
# Discover and scrape as usual — proxy is applied automatically
python main.py discover --location "Miami" --category "restaurant" --type page
python main.py scrape --page-name examplebusiness

# The log will confirm proxy is active:
# INFO - Proxy enabled: <ProxyManager provider=netnut enabled host=gw-resi.netnut.io:5959>
# INFO - Browser using proxy: netnut → gw-resi.netnut.io:5959
```

### Using the Proxy Manager Programmatically

```python
from proxy_manager import ProxyManager

# From config (auto-reads config/scraper_config.json)
pm = ProxyManager.from_config()

# From environment variables
pm = ProxyManager.from_env()

# Manual construction
pm = ProxyManager(
    provider="netnut",
    username="your_user",
    password="your_pass",
    country="us",
    sticky=True
)

# For Playwright browser context
proxy = pm.get_playwright_proxy()
# → {"server": "http://gw-resi.netnut.io:5959", "username": "user-country-us-session-abc123", "password": "pass"}

# For requests / aiohttp
proxies = pm.get_requests_proxy()
# → {"http": "http://user:pass@host:port", "https": "http://user:pass@host:port"}

# Force new IP (rotates session ID)
pm.rotate_session()

# Debug info
print(pm.info())
```

### Best Practices for Long-Running Scrapes

1. **Always use sticky sessions** — Facebook requires consistent IPs during a login session. Set `"sticky": true`.
2. **Target the right country** — Set `"country": "us"` (or your target region) so Facebook serves content in the expected locale.
3. **Combine with existing anti-detection** — This scraper already has fingerprinting, stealth scripts, and human behavior simulation. The proxy is the final layer.
4. **Rotate sessions between accounts** — Call `pm.rotate_session()` when switching Facebook accounts to get a fresh IP.
5. **Use delays** — Even with proxies, respect `delay_between_profiles` in config (default 5-10s) to avoid aggressive patterns.
6. **Monitor your proxy dashboard** — All providers (Bright Data, IProyal, Storm Proxies, NetNut) have dashboards showing bandwidth usage and success rates.

Related Skills

news-hot-scraper

3891
from openclaw/skills

This skill should be used when users need to scrape hot news topics from Chinese platforms (微博、知乎、B站、抖音、今日头条、腾讯新闻、澎湃新闻), generate summaries, and cite sources. It supports both API-based and direct scraping methods, and offers both extractive and abstractive summarization techniques.

Data & Research

social-media-content-scraper-pro

3891
from openclaw/skills

Social Media Content Bulk Scraper, extract articles/posts from WeChat, Instagram, TikTok, YouTube, export to Markdown/HTML with full metadata. $0.005 USDT per use.

Feishu Group Manager

3891
from openclaw/skills

Manage Feishu group chats (settings, names, metadata).

YouTube Channel Scraper

3891
from openclaw/skills

A browser-based YouTube channel discovery and scraping tool.

Twitter/X Profile Scraper

3891
from openclaw/skills

A browser-based Twitter/X profile discovery and scraping tool.

TikTok Profile Scraper

3891
from openclaw/skills

A browser-based TikTok profile discovery and scraping tool.

Instagram Profile Scraper

3891
from openclaw/skills

A browser-based Instagram profile discovery and scraping tool.

feishu-group-thread-reply

3891
from openclaw/skills

Force openclaw-lark bot replies into message threads in Feishu group chats, preventing main chat noise. Patches the plugin's dispatch layer and optionally the feishu-live-card watcher. Use when: (1) Bot replies appear in the main group chat stream instead of threads (2) After updating openclaw-lark plugin (patches get overwritten by npm updates) (3) User mentions "thread reply", "群聊 thread", "回复到话题", "thread 回复" (4) Checking if the thread reply patch is still applied (5) Setting up a new OpenClaw instance with Feishu group chats

telegram-groupchat-setup

3891
from openclaw/skills

Configure a MoltBot agent to participate in a Telegram group chat. Automates adding the group to the allowlist, setting mention patterns, and configuring sender permissions — all via a single gateway config patch. Use when the user wants to set up their bot in a Telegram group, enable cross-bot communication, or configure group mention gating.

grok-scraper

3891
from openclaw/skills

Execute queries to Grok AI via Playwright browser automation without requiring an X API KEY. Use when the user wants to "ask Grok", search X for real-time info, or specifically requests to use Grok for free without API billing.

aws-security-group-auditor

3891
from openclaw/skills

Audit AWS Security Groups and VPC configurations for dangerous internet exposure

one-page-cv

3891
from openclaw/skills

Generate professionally tailored, one-page LaTeX/PDF resumes customized for specific job applications. Use this skill whenever the user mentions resume, CV, job application, JD, job description, tailoring a resume, applying for a job, 简历, 投递, 求职, 岗位, or wants to create/update a resume for a specific role — even if they just paste a job posting without explicitly asking for a resume. Also trigger when the user has resume files in their working directory and asks about job applications or career-related tasks.