data-scraper-agent
构建一个全自动化的AI驱动数据收集代理,适用于任何公共来源——招聘网站、价格信息、新闻、GitHub、体育赛事等任何内容。按计划进行抓取,使用免费LLM(Gemini Flash)丰富数据,将结果存储在Notion/Sheets/Supabase中,并从用户反馈中学习。完全免费在GitHub Actions上运行。适用于用户希望自动监控、收集或跟踪任何公共数据的场景。
About this skill
This skill enables an AI agent to deploy and manage a fully automated, production-ready data collection system. Designed to scrape data from any public web source—including job boards, e-commerce sites, news portals, GitHub repositories, and sports event results—it operates on a scheduled basis. The collected data is then enhanced and categorized using a free large language model (Gemini Flash), before being stored in user-preferred databases like Notion, Google Sheets, or Supabase. The entire operation runs without hosting costs, leveraging GitHub Actions, and intelligently refines its performance based on user feedback. It implements a robust three-layer architecture: COLLECT (Scraper), ENRICH (AI/LLM for scoring, summarizing, classifying), and STORE (Database).
Best use case
Automating the monitoring, collection, and tracking of public data. Examples include tracking job postings, price changes, news updates, software repository activities, sports scores, event listings, and general public information. It's ideal for users who need continuous, updated data feeds without manual intervention or recurring hosting expenses.
构建一个全自动化的AI驱动数据收集代理,适用于任何公共来源——招聘网站、价格信息、新闻、GitHub、体育赛事等任何内容。按计划进行抓取,使用免费LLM(Gemini Flash)丰富数据,将结果存储在Notion/Sheets/Supabase中,并从用户反馈中学习。完全免费在GitHub Actions上运行。适用于用户希望自动监控、收集或跟踪任何公共数据的场景。
A continuously updated, enriched dataset from specified public sources, automatically stored in a chosen database (Notion, Sheets, or Supabase). The user will receive timely, structured information, and the agent will adapt and improve its data collection strategy over time based on interaction.
Practical example
Example input
Build a bot that monitors new job postings for 'AI Engineer' roles on LinkedIn and stores them in a Google Sheet. I need to track the price of NVIDIA RTX 4090 graphics cards on major retailers every day and notify me of significant drops. Store the data in Notion. Set up a system to collect daily news headlines related to 'renewable energy' from BBC News and summarize them using an AI, saving the output to Supabase. Monitor the 'issues' section of the 'everything-claude-code' GitHub repository for new issues and classify them by severity.
Example output
Acknowledged. I'm setting up a scheduled data scraper to monitor 'AI Engineer' job postings on LinkedIn. The data will be enriched using Gemini Flash and stored in your specified Google Sheet. I will provide an update once the initial setup is complete and data starts flowing. Price tracking for NVIDIA RTX 4090 cards across major retailers is now active. I will monitor daily for price changes, enrich the data, and store it in your Notion database. You will be notified of any significant price drops as they occur. The renewable energy news scraper is now deployed. Daily headlines from BBC News will be collected, summarized by AI, and stored in Supabase. I'll inform you when the first batch of enriched data is available.
When to use this skill
- When a user explicitly asks to "build a bot to check...", "monitor X for me", or "collect data from...".
- When a user wants to track specific items like job opportunities, product prices, news articles, GitHub repositories, sports scores, events, or listings.
- When a user seeks to automate data collection without incurring hosting costs.
- When a user desires an agent that becomes "smarter" over time based on their feedback and decisions.
When not to use this skill
- When the desired data is behind private logins, requires authentication beyond basic public access, or is explicitly protected by Terms of Service against scraping.
- When data needs to be collected from sources requiring complex CAPTCHA solving or advanced anti-bot measures that are not easily circumvented.
- When the primary goal is real-time, ultra-low-latency data acquisition for extremely high-frequency trading or similar applications, where GitHub Actions might introduce slight delays.
- When the scraping activity violates legal or ethical guidelines (e.g., scraping personal identifiable information without consent).
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/data-scraper-agent/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How data-scraper-agent Compares
| Feature / Agent | data-scraper-agent | Standard Approach |
|---|---|---|
| Platform Support | Claude | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | medium | N/A |
Frequently Asked Questions
What does this skill do?
构建一个全自动化的AI驱动数据收集代理,适用于任何公共来源——招聘网站、价格信息、新闻、GitHub、体育赛事等任何内容。按计划进行抓取,使用免费LLM(Gemini Flash)丰富数据,将结果存储在Notion/Sheets/Supabase中,并从用户反馈中学习。完全免费在GitHub Actions上运行。适用于用户希望自动监控、收集或跟踪任何公共数据的场景。
Which AI agents support this skill?
This skill is designed for Claude.
How difficult is it to install?
The installation complexity is rated as medium. You can find the installation instructions above.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
AI Agents for Startups
Explore AI agent skills for startup validation, product research, growth experiments, documentation, and fast execution with small teams.
SKILL.md Source
# 数据抓取代理
构建一个生产就绪、AI驱动的数据收集代理,适用于任何公共数据源。
按计划运行,使用免费LLM丰富结果,存储到数据库,并随时间推移不断改进。
**技术栈:Python · Gemini Flash (免费) · GitHub Actions (免费) · Notion / Sheets / Supabase**
## 何时激活
* 用户想要抓取或监控任何公共网站或API
* 用户说"构建一个检查...的机器人"、"为我监控X"、"从...收集数据"
* 用户想要跟踪工作、价格、新闻、仓库、体育比分、事件、列表
* 用户询问如何自动化数据收集而无需支付托管费用
* 用户想要一个能根据他们的决策随时间推移变得更智能的代理
## 核心概念
### 三层架构
每个数据抓取代理都有三层:
```
COLLECT → ENRICH → STORE
│ │ │
Scraper AI (LLM) Database
runs on scores/ Notion /
schedule summarises Sheets /
& classifies Supabase
```
### 免费技术栈
| 层级 | 工具 | 原因 |
|---|---|---|
| **抓取** | `requests` + `BeautifulSoup` | 无成本,覆盖80%的公共网站 |
| **JS渲染的网站** | `playwright` (免费) | 当HTML抓取失败时使用 |
| **AI丰富** | 通过REST API的Gemini Flash | 500次请求/天,100万令牌/天 — 免费 |
| **存储** | Notion API | 免费层级,用于审查的优秀UI |
| **调度** | GitHub Actions cron | 对公共仓库免费 |
| **学习** | 仓库中的JSON反馈文件 | 零基础设施,在git中持久化 |
### AI模型后备链
构建代理以在配额耗尽时自动在Gemini模型间回退:
```
gemini-2.0-flash-lite (30 RPM) →
gemini-2.0-flash (15 RPM) →
gemini-2.5-flash (10 RPM) →
gemini-flash-lite-latest (fallback)
```
### 批量API调用以提高效率
切勿为每个项目单独调用LLM。始终批量处理:
```python
# BAD: 33 API calls for 33 items
for item in items:
result = call_ai(item) # 33 calls → hits rate limit
# GOOD: 7 API calls for 33 items (batch size 5)
for batch in chunks(items, size=5):
results = call_ai(batch) # 7 calls → stays within free tier
```
***
## 工作流程
### 步骤 1: 理解目标
询问用户:
1. **收集什么:** "数据源是什么?URL / API / RSS / 公共端点?"
2. **提取什么:** "哪些字段重要?标题、价格、URL、日期、分数?"
3. **如何存储:** "结果应该存储在哪里?Notion、Google Sheets、Supabase,还是本地文件?"
4. **如何丰富:** "您希望AI对每个项目进行评分、总结、分类或匹配吗?"
5. **频率:** "应该多久运行一次?每小时、每天、每周?"
常见的提示示例:
* 招聘网站 → 根据简历评分相关性
* 产品价格 → 降价时发出警报
* GitHub仓库 → 总结新版本
* 新闻源 → 按主题+情感分类
* 体育结果 → 提取统计数据到跟踪器
* 活动日历 → 按兴趣筛选
***
### 步骤 2: 设计代理架构
为用户生成以下目录结构:
```
my-agent/
├── config.yaml # 用户自定义此文件(关键词、过滤器、偏好设置)
├── profile/
│ └── context.md # AI 使用的用户上下文(简历、兴趣、标准)
├── scraper/
│ ├── __init__.py
│ ├── main.py # 协调器:抓取 → 丰富 → 存储
│ ├── filters.py # 基于规则的预过滤器(快速,在 AI 处理之前)
│ └── sources/
│ ├── __init__.py
│ └── source_name.py # 每个数据源一个文件
├── ai/
│ ├── __init__.py
│ ├── client.py # Gemini REST 客户端,带模型回退
│ ├── pipeline.py # 批量 AI 分析
│ ├── jd_fetcher.py # 从 URL 获取完整内容(可选)
│ └── memory.py # 从用户反馈中学习
├── storage/
│ ├── __init__.py
│ └── notion_sync.py # 或 sheets_sync.py / supabase_sync.py
├── data/
│ └── feedback.json # 用户决策历史(自动更新)
├── .env.example
├── setup.py # 一次性数据库/模式创建
├── enrich_existing.py # 对旧行进行 AI 分数回填
├── requirements.txt
└── .github/
└── workflows/
└── scraper.yml # GitHub Actions 计划任务
```
***
### 步骤 3: 构建抓取器源
适用于任何数据源的模板:
```python
# scraper/sources/my_source.py
"""
[Source Name] — scrapes [what] from [where].
Method: [REST API / HTML scraping / RSS feed]
"""
import requests
from bs4 import BeautifulSoup
from datetime import datetime, timezone
from scraper.filters import is_relevant
HEADERS = {
"User-Agent": "Mozilla/5.0 (compatible; research-bot/1.0)",
}
def fetch() -> list[dict]:
"""
Returns a list of items with consistent schema.
Each item must have at minimum: name, url, date_found.
"""
results = []
# ---- REST API source ----
resp = requests.get("https://api.example.com/items", headers=HEADERS, timeout=15)
if resp.status_code == 200:
for item in resp.json().get("results", []):
if not is_relevant(item.get("title", "")):
continue
results.append(_normalise(item))
return results
def _normalise(raw: dict) -> dict:
"""Convert raw API/HTML data to the standard schema."""
return {
"name": raw.get("title", ""),
"url": raw.get("link", ""),
"source": "MySource",
"date_found": datetime.now(timezone.utc).date().isoformat(),
# add domain-specific fields here
}
```
**HTML抓取模式:**
```python
soup = BeautifulSoup(resp.text, "lxml")
for card in soup.select("[class*='listing']"):
title = card.select_one("h2, h3").get_text(strip=True)
link = card.select_one("a")["href"]
if not link.startswith("http"):
link = f"https://example.com{link}"
```
**RSS源模式:**
```python
import xml.etree.ElementTree as ET
root = ET.fromstring(resp.text)
for item in root.findall(".//item"):
title = item.findtext("title", "")
link = item.findtext("link", "")
```
***
### 步骤 4: 构建Gemini AI客户端
````python
# ai/client.py
import os, json, time, requests
_last_call = 0.0
MODEL_FALLBACK = [
"gemini-2.0-flash-lite",
"gemini-2.0-flash",
"gemini-2.5-flash",
"gemini-flash-lite-latest",
]
def generate(prompt: str, model: str = "", rate_limit: float = 7.0) -> dict:
"""Call Gemini with auto-fallback on 429. Returns parsed JSON or {}."""
global _last_call
api_key = os.environ.get("GEMINI_API_KEY", "")
if not api_key:
return {}
elapsed = time.time() - _last_call
if elapsed < rate_limit:
time.sleep(rate_limit - elapsed)
models = [model] + [m for m in MODEL_FALLBACK if m != model] if model else MODEL_FALLBACK
_last_call = time.time()
for m in models:
url = f"https://generativelanguage.googleapis.com/v1beta/models/{m}:generateContent?key={api_key}"
payload = {
"contents": [{"parts": [{"text": prompt}]}],
"generationConfig": {
"responseMimeType": "application/json",
"temperature": 0.3,
"maxOutputTokens": 2048,
},
}
try:
resp = requests.post(url, json=payload, timeout=30)
if resp.status_code == 200:
return _parse(resp)
if resp.status_code in (429, 404):
time.sleep(1)
continue
return {}
except requests.RequestException:
return {}
return {}
def _parse(resp) -> dict:
try:
text = (
resp.json()
.get("candidates", [{}])[0]
.get("content", {})
.get("parts", [{}])[0]
.get("text", "")
.strip()
)
if text.startswith("```"):
text = text.split("\n", 1)[-1].rsplit("```", 1)[0]
return json.loads(text)
except (json.JSONDecodeError, KeyError):
return {}
````
***
### 步骤 5: 构建AI管道(批量)
```python
# ai/pipeline.py
import json
import yaml
from pathlib import Path
from ai.client import generate
def analyse_batch(items: list[dict], context: str = "", preference_prompt: str = "") -> list[dict]:
"""Analyse items in batches. Returns items enriched with AI fields."""
config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())
model = config.get("ai", {}).get("model", "gemini-2.5-flash")
rate_limit = config.get("ai", {}).get("rate_limit_seconds", 7.0)
min_score = config.get("ai", {}).get("min_score", 0)
batch_size = config.get("ai", {}).get("batch_size", 5)
batches = [items[i:i + batch_size] for i in range(0, len(items), batch_size)]
print(f" [AI] {len(items)} items → {len(batches)} API calls")
enriched = []
for i, batch in enumerate(batches):
print(f" [AI] Batch {i + 1}/{len(batches)}...")
prompt = _build_prompt(batch, context, preference_prompt, config)
result = generate(prompt, model=model, rate_limit=rate_limit)
analyses = result.get("analyses", [])
for j, item in enumerate(batch):
ai = analyses[j] if j < len(analyses) else {}
if ai:
score = max(0, min(100, int(ai.get("score", 0))))
if min_score and score < min_score:
continue
enriched.append({**item, "ai_score": score, "ai_summary": ai.get("summary", ""), "ai_notes": ai.get("notes", "")})
else:
enriched.append(item)
return enriched
def _build_prompt(batch, context, preference_prompt, config):
priorities = config.get("priorities", [])
items_text = "\n\n".join(
f"Item {i+1}: {json.dumps({k: v for k, v in item.items() if not k.startswith('_')})}"
for i, item in enumerate(batch)
)
return f"""Analyse these {len(batch)} items and return a JSON object.
# Items
{items_text}
# User Context
{context[:800] if context else "Not provided"}
# User Priorities
{chr(10).join(f"- {p}" for p in priorities)}
{preference_prompt}
# Instructions
Return: {{"analyses": [{{"score": <0-100>, "summary": "<2 sentences>", "notes": "<why this matches or doesn't>"}} for each item in order]}}
Be concise. Score 90+=excellent match, 70-89=good, 50-69=ok, <50=weak."""
```
***
### 步骤 6: 构建反馈学习系统
```python
# ai/memory.py
"""Learn from user decisions to improve future scoring."""
import json
from pathlib import Path
FEEDBACK_PATH = Path(__file__).parent.parent / "data" / "feedback.json"
def load_feedback() -> dict:
if FEEDBACK_PATH.exists():
try:
return json.loads(FEEDBACK_PATH.read_text())
except (json.JSONDecodeError, OSError):
pass
return {"positive": [], "negative": []}
def save_feedback(fb: dict):
FEEDBACK_PATH.parent.mkdir(parents=True, exist_ok=True)
FEEDBACK_PATH.write_text(json.dumps(fb, indent=2))
def build_preference_prompt(feedback: dict, max_examples: int = 15) -> str:
"""Convert feedback history into a prompt bias section."""
lines = []
if feedback.get("positive"):
lines.append("# Items the user LIKED (positive signal):")
for e in feedback["positive"][-max_examples:]:
lines.append(f"- {e}")
if feedback.get("negative"):
lines.append("\n# Items the user SKIPPED/REJECTED (negative signal):")
for e in feedback["negative"][-max_examples:]:
lines.append(f"- {e}")
if lines:
lines.append("\nUse these patterns to bias scoring on new items.")
return "\n".join(lines)
```
**与存储层集成:** 每次运行后,从数据库中查询具有正面/负面状态的项,并使用提取的模式调用 `save_feedback()`。
***
### 步骤 7: 构建存储(Notion示例)
```python
# storage/notion_sync.py
import os
from notion_client import Client
from notion_client.errors import APIResponseError
_client = None
def get_client():
global _client
if _client is None:
_client = Client(auth=os.environ["NOTION_TOKEN"])
return _client
def get_existing_urls(db_id: str) -> set[str]:
"""Fetch all URLs already stored — used for deduplication."""
client, seen, cursor = get_client(), set(), None
while True:
resp = client.databases.query(database_id=db_id, page_size=100, **{"start_cursor": cursor} if cursor else {})
for page in resp["results"]:
url = page["properties"].get("URL", {}).get("url", "")
if url: seen.add(url)
if not resp["has_more"]: break
cursor = resp["next_cursor"]
return seen
def push_item(db_id: str, item: dict) -> bool:
"""Push one item to Notion. Returns True on success."""
props = {
"Name": {"title": [{"text": {"content": item.get("name", "")[:100]}}]},
"URL": {"url": item.get("url")},
"Source": {"select": {"name": item.get("source", "Unknown")}},
"Date Found": {"date": {"start": item.get("date_found")}},
"Status": {"select": {"name": "New"}},
}
# AI fields
if item.get("ai_score") is not None:
props["AI Score"] = {"number": item["ai_score"]}
if item.get("ai_summary"):
props["Summary"] = {"rich_text": [{"text": {"content": item["ai_summary"][:2000]}}]}
if item.get("ai_notes"):
props["Notes"] = {"rich_text": [{"text": {"content": item["ai_notes"][:2000]}}]}
try:
get_client().pages.create(parent={"database_id": db_id}, properties=props)
return True
except APIResponseError as e:
print(f"[notion] Push failed: {e}")
return False
def sync(db_id: str, items: list[dict]) -> tuple[int, int]:
existing = get_existing_urls(db_id)
added = skipped = 0
for item in items:
if item.get("url") in existing:
skipped += 1; continue
if push_item(db_id, item):
added += 1; existing.add(item["url"])
else:
skipped += 1
return added, skipped
```
***
### 步骤 8: 在 main.py 中编排
```python
# scraper/main.py
import os, sys, yaml
from pathlib import Path
from dotenv import load_dotenv
load_dotenv()
from scraper.sources import my_source # add your sources
# NOTE: This example uses Notion. If storage.provider is "sheets" or "supabase",
# replace this import with storage.sheets_sync or storage.supabase_sync and update
# the env var and sync() call accordingly.
from storage.notion_sync import sync
SOURCES = [
("My Source", my_source.fetch),
]
def ai_enabled():
return bool(os.environ.get("GEMINI_API_KEY"))
def main():
config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())
provider = config.get("storage", {}).get("provider", "notion")
# Resolve the storage target identifier from env based on provider
if provider == "notion":
db_id = os.environ.get("NOTION_DATABASE_ID")
if not db_id:
print("ERROR: NOTION_DATABASE_ID not set"); sys.exit(1)
else:
# Extend here for sheets (SHEET_ID) or supabase (SUPABASE_TABLE) etc.
print(f"ERROR: provider '{provider}' not yet wired in main.py"); sys.exit(1)
config = yaml.safe_load((Path(__file__).parent.parent / "config.yaml").read_text())
all_items = []
for name, fetch_fn in SOURCES:
try:
items = fetch_fn()
print(f"[{name}] {len(items)} items")
all_items.extend(items)
except Exception as e:
print(f"[{name}] FAILED: {e}")
# Deduplicate by URL
seen, deduped = set(), []
for item in all_items:
if (url := item.get("url", "")) and url not in seen:
seen.add(url); deduped.append(item)
print(f"Unique items: {len(deduped)}")
if ai_enabled() and deduped:
from ai.memory import load_feedback, build_preference_prompt
from ai.pipeline import analyse_batch
# load_feedback() reads data/feedback.json written by your feedback sync script.
# To keep it current, implement a separate feedback_sync.py that queries your
# storage provider for items with positive/negative statuses and calls save_feedback().
feedback = load_feedback()
preference = build_preference_prompt(feedback)
context_path = Path(__file__).parent.parent / "profile" / "context.md"
context = context_path.read_text() if context_path.exists() else ""
deduped = analyse_batch(deduped, context=context, preference_prompt=preference)
else:
print("[AI] Skipped — GEMINI_API_KEY not set")
added, skipped = sync(db_id, deduped)
print(f"Done — {added} new, {skipped} existing")
if __name__ == "__main__":
main()
```
***
### 步骤 9: GitHub Actions工作流
```yaml
# .github/workflows/scraper.yml
name: Data Scraper Agent
on:
schedule:
- cron: "0 */3 * * *" # every 3 hours — adjust to your needs
workflow_dispatch: # allow manual trigger
permissions:
contents: write # required for the feedback-history commit step
jobs:
scrape:
runs-on: ubuntu-latest
timeout-minutes: 20
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
cache: "pip"
- run: pip install -r requirements.txt
# Uncomment if Playwright is enabled in requirements.txt
# - name: Install Playwright browsers
# run: python -m playwright install chromium --with-deps
- name: Run agent
env:
NOTION_TOKEN: ${{ secrets.NOTION_TOKEN }}
NOTION_DATABASE_ID: ${{ secrets.NOTION_DATABASE_ID }}
GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
run: python -m scraper.main
- name: Commit feedback history
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add data/feedback.json || true
git diff --cached --quiet || git commit -m "chore: update feedback history"
git push
```
***
### 步骤 10: config.yaml 模板
```yaml
# Customise this file — no code changes needed
# What to collect (pre-filter before AI)
filters:
required_keywords: [] # item must contain at least one
blocked_keywords: [] # item must not contain any
# Your priorities — AI uses these for scoring
priorities:
- "example priority 1"
- "example priority 2"
# Storage
storage:
provider: "notion" # notion | sheets | supabase | sqlite
# Feedback learning
feedback:
positive_statuses: ["Saved", "Applied", "Interested"]
negative_statuses: ["Skip", "Rejected", "Not relevant"]
# AI settings
ai:
enabled: true
model: "gemini-2.5-flash"
min_score: 0 # filter out items below this score
rate_limit_seconds: 7 # seconds between API calls
batch_size: 5 # items per API call
```
***
## 常见抓取模式
### 模式 1: REST API(最简单)
```python
resp = requests.get(url, params={"q": query}, headers=HEADERS, timeout=15)
items = resp.json().get("results", [])
```
### 模式 2: HTML抓取
```python
soup = BeautifulSoup(resp.text, "lxml")
for card in soup.select(".listing-card"):
title = card.select_one("h2").get_text(strip=True)
href = card.select_one("a")["href"]
```
### 模式 3: RSS源
```python
import xml.etree.ElementTree as ET
root = ET.fromstring(resp.text)
for item in root.findall(".//item"):
title = item.findtext("title", "")
link = item.findtext("link", "")
pub_date = item.findtext("pubDate", "")
```
### 模式 4: 分页API
```python
page = 1
while True:
resp = requests.get(url, params={"page": page, "limit": 50}, timeout=15)
data = resp.json()
items = data.get("results", [])
if not items:
break
for item in items:
results.append(_normalise(item))
if not data.get("has_more"):
break
page += 1
```
### 模式 5: JS渲染页面(Playwright)
```python
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url)
page.wait_for_selector(".listing")
html = page.content()
browser.close()
soup = BeautifulSoup(html, "lxml")
```
***
## 需要避免的反模式
| 反模式 | 问题 | 修复方法 |
|---|---|---|
| 每个项目调用一次LLM | 立即达到速率限制 | 每次调用批量处理5个项目 |
| 代码中硬编码关键字 | 不可重用 | 将所有配置移动到 `config.yaml` |
| 没有速率限制的抓取 | IP被禁止 | 在请求之间添加 `time.sleep(1)` |
| 在代码中存储密钥 | 安全风险 | 始终使用 `.env` + GitHub Secrets |
| 没有去重 | 重复行堆积 | 在推送前始终检查URL |
| 忽略 `robots.txt` | 法律/道德风险 | 遵守爬虫规则;尽可能使用公共API |
| 使用 `requests` 处理JS渲染的网站 | 空响应 | 使用Playwright或查找底层API |
| `maxOutputTokens` 太低 | JSON截断,解析错误 | 对批量响应使用2048+ |
***
## 免费层级限制参考
| 服务 | 免费限制 | 典型用法 |
|---|---|---|
| Gemini Flash Lite | 30 RPM, 1500 RPD | 以3小时间隔约56次请求/天 |
| Gemini 2.0 Flash | 15 RPM, 1500 RPD | 良好的后备选项 |
| Gemini 2.5 Flash | 10 RPM, 500 RPD | 谨慎使用 |
| GitHub Actions | 无限(公共仓库) | 约20分钟/天 |
| Notion API | 无限 | 约200次写入/天 |
| Supabase | 500MB DB, 2GB传输 | 适用于大多数代理 |
| Google Sheets API | 300次请求/分钟 | 适用于小型代理 |
***
## 需求模板
```
requests==2.31.0
beautifulsoup4==4.12.3
lxml==5.1.0
python-dotenv==1.0.1
pyyaml==6.0.2
notion-client==2.2.1 # 如需使用 Notion
# playwright==1.40.0 # 针对 JS 渲染的站点,请取消注释
```
***
## 质量检查清单
在将代理标记为完成之前:
* \[ ] `config.yaml` 控制所有面向用户的设置 — 没有硬编码的值
* \[ ] `profile/context.md` 保存用于AI匹配的用户特定上下文
* \[ ] 在每次存储推送前通过URL进行去重
* \[ ] Gemini客户端具有模型后备链(4个模型)
* \[ ] 批量大小 ≤ 每个API调用5个项目
* \[ ] `maxOutputTokens` ≥ 2048
* \[ ] `.env` 在 `.gitignore` 中
* \[ ] 提供了用于入门的 `.env.example`
* \[ ] `setup.py` 在首次运行时创建数据库模式
* \[ ] `enrich_existing.py` 回填旧行的AI分数
* \[ ] GitHub Actions工作流在每次运行后提交 `feedback.json`
* \[ ] README涵盖:在<5分钟内设置,所需的密钥,自定义
***
## 真实世界示例
```
"为我构建一个监控 Hacker News 上 AI 初创公司融资新闻的智能体"
"从 3 家电商网站抓取产品价格并在降价时发出提醒"
"追踪标记有 'llm' 或 'agents' 的新 GitHub 仓库——并为每个仓库生成摘要"
"将 LinkedIn 和 Cutshort 上的首席运营官职位列表收集到 Notion 中"
"监控一个提到我公司的 subreddit 帖子——并进行情感分类"
"每日从 arXiv 抓取我关注主题的新学术论文"
"追踪体育赛事结果并在 Google Sheets 中维护动态更新的表格"
"构建一个房地产房源监控器——在新房源价格低于 1 千万卢比时发出提醒"
```
***
## 参考实现
一个使用此确切架构构建的完整工作代理将抓取4+个数据源,
批量处理Gemini调用,从存储在Notion中的"已应用"/"已拒绝"决策中学习,并且
在GitHub Actions上100%免费运行。按照上述步骤1-9构建您自己的代理。Related Skills
junta-leiloeiros
Coleta e consulta dados de leiloeiros oficiais de todas as 27 Juntas Comerciais do Brasil. Scraper multi-UF, banco SQLite, API FastAPI e exportacao CSV/JSON.
database-migrations
Database migration best practices for schema changes, data migrations, rollbacks, and zero-downtime deployments across PostgreSQL, MySQL, and common ORMs (Prisma, Drizzle, Django, TypeORM, golang-migrate). Use when planning or implementing database schema changes.
workspace-surface-audit
Audit the active repo, MCP servers, plugins, connectors, env surfaces, and harness setup, then recommend the highest-value ECC-native skills, hooks, agents, and operator workflows. Use when the user wants help setting up Claude Code or understanding what capabilities are actually available in their environment.
ui-demo
Record polished UI demo videos using Playwright. Use when the user asks to create a demo, walkthrough, screen recording, or tutorial video of a web application. Produces WebM videos with visible cursor, natural pacing, and professional feel.
token-budget-advisor
Offers the user an informed choice about how much response depth to consume before answering. Use this skill when the user explicitly wants to control response length, depth, or token budget. TRIGGER when: "token budget", "token count", "token usage", "token limit", "response length", "answer depth", "short version", "brief answer", "detailed answer", "exhaustive answer", "respuesta corta vs larga", "cuántos tokens", "ahorrar tokens", "responde al 50%", "dame la versión corta", "quiero controlar cuánto usas", or clear variants where the user is explicitly asking to control answer size or depth. DO NOT TRIGGER when: user has already specified a level in the current session (maintain it), the request is clearly a one-word answer, or "token" refers to auth/session/payment tokens rather than response size.
skill-comply
Visualize whether skills, rules, and agent definitions are actually followed — auto-generates scenarios at 3 prompt strictness levels, runs agents, classifies behavioral sequences, and reports compliance rates with full tool call timelines
santa-method
Multi-agent adversarial verification with convergence loop. Two independent review agents must both pass before output ships.
safety-guard
Use this skill to prevent destructive operations when working on production systems or running agents autonomously.
repo-scan
Cross-stack source code asset audit — classifies every file, detects embedded third-party libraries, and delivers actionable four-level verdicts per module with interactive HTML reports.
project-flow-ops
Operate execution flow across GitHub and Linear by triaging issues and pull requests, linking active work, and keeping GitHub public-facing while Linear remains the internal execution layer. Use when the user wants backlog control, PR triage, or GitHub-to-Linear coordination.
product-lens
Use this skill to validate the "why" before building, run product diagnostics, and pressure-test product direction before the request becomes an implementation contract.
openclaw-persona-forge
为 OpenClaw AI Agent 锻造完整的龙虾灵魂方案。根据用户偏好或随机抽卡, 输出身份定位、灵魂描述(SOUL.md)、角色化底线规则、名字和头像生图提示词。 如当前环境提供已审核的生图 skill,可自动生成统一风格头像图片。 当用户需要创建、设计或定制 OpenClaw 龙虾灵魂时使用。 不适用于:微调已有 SOUL.md、非 OpenClaw 平台的角色设计、纯工具型无性格 Agent。 触发词:龙虾灵魂、虾魂、OpenClaw 灵魂、养虾灵魂、龙虾角色、龙虾定位、 龙虾剧本杀角色、龙虾游戏角色、龙虾 NPC、龙虾性格、龙虾背景故事、 lobster soul、lobster character、抽卡、随机龙虾、龙虾 SOUL、gacha。