firecrawl-policy-guardrails
Implement Firecrawl scraping policy enforcement: domain blocklists, credit budgets, content filtering, and robots.txt compliance guardrails. Use when setting up scraping policies, enforcing crawl limits, or preventing accidental scraping of prohibited domains. Trigger with phrases like "firecrawl policy", "firecrawl guardrails", "firecrawl domain blocklist", "firecrawl scraping rules", "firecrawl compliance".
Best use case
firecrawl-policy-guardrails is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Implement Firecrawl scraping policy enforcement: domain blocklists, credit budgets, content filtering, and robots.txt compliance guardrails. Use when setting up scraping policies, enforcing crawl limits, or preventing accidental scraping of prohibited domains. Trigger with phrases like "firecrawl policy", "firecrawl guardrails", "firecrawl domain blocklist", "firecrawl scraping rules", "firecrawl compliance".
Teams using firecrawl-policy-guardrails should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/firecrawl-policy-guardrails/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How firecrawl-policy-guardrails Compares
| Feature / Agent | firecrawl-policy-guardrails | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Implement Firecrawl scraping policy enforcement: domain blocklists, credit budgets, content filtering, and robots.txt compliance guardrails. Use when setting up scraping policies, enforcing crawl limits, or preventing accidental scraping of prohibited domains. Trigger with phrases like "firecrawl policy", "firecrawl guardrails", "firecrawl domain blocklist", "firecrawl scraping rules", "firecrawl compliance".
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
SKILL.md Source
# Firecrawl Policy Guardrails
## Overview
Automated guardrails for Firecrawl scraping pipelines. Web scraping carries legal (robots.txt, ToS), ethical (rate limiting, attribution), and cost (credit burn) risks. This skill implements domain blocklists, credit budgets, content quality gates, and per-domain rate limits as enforceable policies.
## Instructions
### Step 1: Domain Policy Enforcement
```typescript
import FirecrawlApp from "@mendable/firecrawl-js";
const firecrawl = new FirecrawlApp({
apiKey: process.env.FIRECRAWL_API_KEY!,
});
class ScrapePolicy {
// Domains that explicitly prohibit scraping in their ToS
static BLOCKED_DOMAINS = [
"facebook.com", "instagram.com", // Meta ToS
"linkedin.com", // LinkedIn ToS
"twitter.com", "x.com", // X/Twitter ToS
];
// Domains with sensitive/regulated content
static SENSITIVE_DOMAINS = [
"*.gov", "*.mil", // Government
"*.edu", // Educational (FERPA)
];
static validateUrl(url: string): void {
const hostname = new URL(url).hostname;
for (const blocked of this.BLOCKED_DOMAINS) {
if (hostname === blocked || hostname.endsWith(`.${blocked}`)) {
throw new PolicyViolation(`Domain "${hostname}" is blocked: ToS prohibits scraping`);
}
}
for (const pattern of this.SENSITIVE_DOMAINS) {
const regex = new RegExp("^" + pattern.replace("*.", ".*\\.") + "$");
if (regex.test(hostname)) {
console.warn(`CAUTION: "${hostname}" matches sensitive domain pattern "${pattern}"`);
}
}
}
}
class PolicyViolation extends Error {
constructor(message: string) {
super(message);
this.name = "PolicyViolation";
}
}
```
### Step 2: Credit Budget Enforcement
```typescript
class CrawlBudget {
private usage = new Map<string, number>();
private dailyLimit: number;
constructor(dailyLimit = 5000) {
this.dailyLimit = dailyLimit;
}
authorize(estimatedPages: number): void {
const today = new Date().toISOString().split("T")[0];
const used = this.usage.get(today) || 0;
if (used + estimatedPages > this.dailyLimit) {
throw new PolicyViolation(
`Daily credit limit would be exceeded: ${used} used + ${estimatedPages} requested > ${this.dailyLimit} limit`
);
}
}
record(pagesScraped: number) {
const today = new Date().toISOString().split("T")[0];
this.usage.set(today, (this.usage.get(today) || 0) + pagesScraped);
}
}
const budget = new CrawlBudget(5000);
```
### Step 3: Content Quality Gate
```typescript
function validateScrapedContent(result: any): {
accepted: boolean;
reason?: string;
} {
const md = result.markdown || "";
// Reject thin content
if (md.length < 50) {
return { accepted: false, reason: "Content too short (<50 chars)" };
}
// Reject error pages
if (/403 forbidden|access denied|captcha/i.test(md)) {
return { accepted: false, reason: "Error page detected" };
}
// Reject login walls
if (/sign in to continue|create an account|login required/i.test(md)) {
return { accepted: false, reason: "Login wall detected" };
}
// Reject cookie consent pages (only content is cookie notice)
if (md.length < 500 && /cookie|consent|gdpr/i.test(md)) {
return { accepted: false, reason: "Cookie consent page only" };
}
return { accepted: true };
}
```
### Step 4: Crawl Limit Enforcement
```typescript
const MAX_CRAWL_LIMIT = 500;
const MAX_DEPTH = 5;
async function policedCrawl(url: string, requestedLimit: number) {
// Validate URL
ScrapePolicy.validateUrl(url);
// Enforce hard limits
const limit = Math.min(requestedLimit, MAX_CRAWL_LIMIT);
if (requestedLimit > MAX_CRAWL_LIMIT) {
console.warn(`Crawl limit capped: ${requestedLimit} -> ${MAX_CRAWL_LIMIT}`);
}
// Check budget
budget.authorize(limit);
// Execute with enforced limits
const result = await firecrawl.crawlUrl(url, {
limit,
maxDepth: MAX_DEPTH,
scrapeOptions: { formats: ["markdown"], onlyMainContent: true },
});
// Record actual usage
const pagesScraped = result.data?.length || 0;
budget.record(pagesScraped);
// Filter by content quality
const validPages = (result.data || []).filter(page => {
const { accepted, reason } = validateScrapedContent(page);
if (!accepted) console.log(`Rejected: ${page.metadata?.sourceURL} — ${reason}`);
return accepted;
});
console.log(`Crawl: ${pagesScraped} scraped, ${validPages.length} accepted, ${pagesScraped - validPages.length} rejected`);
return validPages;
}
```
### Step 5: Per-Domain Rate Limiting
```typescript
const DOMAIN_RATE_LIMITS: Record<string, number> = {
"docs.example.com": 2, // 2 requests/second
"blog.example.com": 1, // 1 request/second
default: 5, // 5 requests/second
};
const lastRequest = new Map<string, number>();
async function rateLimitedScrape(url: string) {
const domain = new URL(url).hostname;
const rate = DOMAIN_RATE_LIMITS[domain] || DOMAIN_RATE_LIMITS.default;
const minInterval = 1000 / rate;
const last = lastRequest.get(domain) || 0;
const elapsed = Date.now() - last;
if (elapsed < minInterval) {
await new Promise(r => setTimeout(r, minInterval - elapsed));
}
lastRequest.set(domain, Date.now());
return firecrawl.scrapeUrl(url, { formats: ["markdown"] });
}
```
## Policy Summary
| Policy | Enforcement | Consequence |
|--------|-------------|-------------|
| Domain blocklist | Pre-request check | Request rejected with PolicyViolation |
| Credit budget | Pre-request check | Request rejected if over daily limit |
| Crawl limit | Hard cap at 500 | Silently capped, logged |
| Content quality | Post-scrape filter | Invalid pages excluded from results |
| Per-domain rate | Pre-request delay | Automatic throttling |
## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| PolicyViolation thrown | Blocked domain | Remove from scrape targets |
| Budget exceeded | Heavy scraping day | Increase daily limit or wait |
| Many rejected pages | Error/login pages | Check target site, adjust URL patterns |
| Slow scraping | Per-domain rate limit | Expected behavior, protects target site |
## Examples
### Policy-Checked Pipeline
```typescript
async function scrapePipeline(urls: string[]) {
const results = [];
for (const url of urls) {
try {
ScrapePolicy.validateUrl(url);
budget.authorize(1);
const result = await rateLimitedScrape(url);
const { accepted } = validateScrapedContent(result);
if (accepted) results.push(result);
budget.record(1);
} catch (e) {
if (e instanceof PolicyViolation) {
console.warn(`Policy: ${e.message}`);
} else {
console.error(`Error: ${(e as Error).message}`);
}
}
}
return results;
}
```
## Resources
- [Firecrawl Docs](https://docs.firecrawl.dev)
- [robots.txt Spec](https://www.robotstxt.org/robotstxt.html)
- [Web Scraping Legal Guide](https://www.eff.org/issues/web-scraping)
## Next Steps
For architecture patterns, see `firecrawl-architecture-variants`.Related Skills
windsurf-policy-guardrails
Implement team-wide Windsurf usage policies, code quality gates, and Cascade guardrails. Use when setting up code review policies for AI-generated code, configuring Turbo mode safety controls, or implementing CI gates for Cascade output. Trigger with phrases like "windsurf policy", "windsurf guardrails", "cascade safety rules", "windsurf team rules", "AI code policy".
vercel-policy-guardrails
Implement lint rules, CI policy checks, and automated guardrails for Vercel projects. Use when setting up code quality rules, preventing secret exposure, or enforcing deployment policies for Vercel applications. Trigger with phrases like "vercel policy", "vercel lint", "vercel guardrails", "vercel best practices check", "vercel secret scan".
supabase-policy-guardrails
Enforce organizational governance for Supabase projects: shared RLS policy library with reusable templates, table and column naming conventions, migration review process with CI checks, cost alert thresholds, and security audit scripts scanning for common misconfigurations. Use when establishing Supabase standards across teams, creating RLS policy templates, setting up migration review workflows, or auditing existing projects for security and cost issues. Trigger with phrases like "supabase governance", "supabase policy library", "supabase naming convention", "supabase migration review", "supabase cost alert", "supabase security audit", "supabase RLS template".
snowflake-policy-guardrails
Implement Snowflake governance guardrails with network rules, session policies, authentication policies, and automated compliance checks. Use when enforcing security policies, implementing data governance, or configuring automated compliance for Snowflake. Trigger with phrases like "snowflake policy", "snowflake guardrails", "snowflake governance", "snowflake compliance", "snowflake enforce".
shopify-policy-guardrails
Implement Shopify app policy enforcement with ESLint rules for API key detection, query cost budgets, and App Store compliance checks. Trigger with phrases like "shopify policy", "shopify lint", "shopify guardrails", "shopify compliance", "shopify eslint", "shopify app review".
sentry-policy-guardrails
Enforce organizational governance and policy guardrails for Sentry usage. Use when standardizing Sentry configuration across services, enforcing PII scrubbing, building shared config packages, or auditing drift. Trigger with phrases like "sentry governance", "sentry policy", "sentry standards", "enforce sentry config", "sentry compliance".
salesforce-policy-guardrails
Implement Salesforce lint rules, SOQL injection prevention, and API usage guardrails. Use when enforcing Salesforce integration code quality, preventing SOQL injection, or configuring CI policy checks for Salesforce best practices. Trigger with phrases like "salesforce policy", "salesforce lint", "salesforce guardrails", "SOQL injection", "salesforce eslint", "salesforce code review".
retellai-policy-guardrails
Retell AI policy guardrails — AI voice agent and phone call automation. Use when working with Retell AI for voice agents, phone calls, or telephony. Trigger with phrases like "retell policy guardrails", "retellai-policy-guardrails", "voice agent".
perplexity-policy-guardrails
Implement content moderation, model selection policy, citation quality enforcement, and per-user usage quotas for Perplexity Sonar API. Trigger with phrases like "perplexity policy", "perplexity guardrails", "perplexity content moderation", "perplexity usage limits", "perplexity safety".
notion-policy-guardrails
Governance for Notion integrations: integration naming standards, page sharing policies, property naming conventions, database schema standards, and access audit scripts. Trigger with phrases like "notion governance", "notion policy", "notion naming convention", "notion access audit", "notion schema standard".
klingai-content-policy
Implement content policy compliance for Kling AI prompts and outputs. Use when filtering user prompts or handling moderation. Trigger with phrases like 'klingai content policy', 'kling ai moderation', 'safe video generation', 'klingai content filter'.
hubspot-policy-guardrails
Implement HubSpot lint rules, secret scanning, and CI policy checks. Use when setting up code quality rules for HubSpot integrations, preventing token leaks, or configuring CI guardrails. Trigger with phrases like "hubspot policy", "hubspot lint", "hubspot guardrails", "hubspot security check", "hubspot eslint rules".