spacy-nlp

Natural language processing with spaCy. Use when: (1) named entity recognition, (2) POS tagging and dependency parsing, (3) text tokenization and linguistic analysis, (4) rule-based pattern matching, (5) custom NER pipelines. NOT for: LLM inference or text generation (use transformers), sentiment analysis at scale (use dedicated models), or machine translation.

564 stars

bybeita6969

View on GitHub Installation ↓

Best use case

spacy-nlp is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Teams using spacy-nlp should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/spacy-nlp/SKILL.md --create-dirs "https://raw.githubusercontent.com/beita6969/ScienceClaw/main/skills/spacy-nlp/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/spacy-nlp/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How spacy-nlp Compares

Feature / Agent	spacy-nlp	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

SKILL.md Source

# spaCy NLP

Natural language processing using spaCy for tokenization, named entity
recognition, dependency parsing, and linguistic analysis.

## When to Use

- Named entity recognition (NER) on text
- Part-of-speech tagging and morphological analysis
- Dependency parsing and syntactic analysis
- Rule-based pattern matching in text
- Custom NER pipeline creation
- Tokenization and sentence segmentation
- Lemmatization and linguistic feature extraction

## When NOT to Use

- LLM inference or text generation (use transformers/huggingface)
- Sentiment analysis at scale (use fine-tuned classifiers)
- Machine translation (use dedicated MT models)
- Topic modeling (use gensim or sklearn)
- Simple regex-only text search (use re module)

## Setup and Model Download

```bash
# Download models (run once before using)
python3 -m spacy download en_core_web_sm    # small, fast, ~12MB
python3 -m spacy download en_core_web_md    # medium with word vectors, ~40MB
python3 -m spacy download en_core_web_trf   # transformer-based, most accurate
```

## Basic Pipeline

```python
import spacy

# Load a model
nlp = spacy.load("en_core_web_sm")

# Process text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")

# Token-level attributes
for token in doc:
    print(f"{token.text:12} {token.pos_:6} {token.dep_:10} {token.lemma_}")
# Apple        PROPN  nsubj      Apple
# is           AUX    aux        be
# looking      VERB   ROOT       look
# ...
```

## Named Entity Recognition

```python
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976.")

# Extract entities
for ent in doc.ents:
    print(f"{ent.text:25} {ent.label_:10} {ent.start_char}-{ent.end_char}")
# Apple Inc.                ORG        0-10
# Steve Jobs                PERSON     26-36
# Cupertino, California     GPE        40-61
# 1976                      DATE       65-69

# Common entity labels: PERSON, ORG, GPE, DATE, MONEY, PRODUCT, EVENT, LOC

# Get explanation of labels
print(spacy.explain("GPE"))   # "Countries, cities, states"
```

## Dependency Parsing

```python
nlp = spacy.load("en_core_web_sm")
doc = nlp("The quick brown fox jumps over the lazy dog.")

# Dependency tree
for token in doc:
    print(f"{token.text:10} --{token.dep_:10}--> {token.head.text}")

# Noun chunks (base noun phrases)
for chunk in doc.noun_chunks:
    print(f"{chunk.text:25} root={chunk.root.text}, head={chunk.root.head.text}")
# The quick brown fox       root=fox, head=jumps
# the lazy dog              root=dog, head=over

# Find subject and object of a verb
for token in doc:
    if token.dep_ == "nsubj":
        print(f"Subject: {token.text} of verb: {token.head.text}")
    if token.dep_ == "dobj":
        print(f"Object: {token.text} of verb: {token.head.text}")
```

## Pattern Matching

```python
from spacy.matcher import Matcher, PhraseMatcher

nlp = spacy.load("en_core_web_sm")

# Token-based pattern matching
matcher = Matcher(nlp.vocab)

# Pattern: adjective followed by one or more nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN", "OP": "+"}]
matcher.add("ADJ_NOUN", [pattern])

doc = nlp("The bright blue sky and cold winter morning greeted us.")
matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(f"Match: {span.text}")
# Match: bright blue sky
# Match: cold winter morning

# Phrase matching (exact phrase lookup, very fast)
phrase_matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
terms = ["machine learning", "deep learning", "natural language processing"]
patterns = [nlp.make_doc(term) for term in terms]
phrase_matcher.add("TECH_TERMS", patterns)

doc = nlp("This paper covers machine learning and natural language processing.")
matches = phrase_matcher(doc)
for match_id, start, end in matches:
    print(f"Found: {doc[start:end].text}")
```

## Custom Entity Rules

```python
from spacy.pipeline import EntityRuler

nlp = spacy.load("en_core_web_sm")

# Add entity ruler before the NER component
ruler = nlp.add_pipe("entity_ruler", before="ner")

patterns = [
    {"label": "DRUG", "pattern": "aspirin"},
    {"label": "DRUG", "pattern": [{"LOWER": "vitamin"}, {"LOWER": "d"}]},
    {"label": "DISEASE", "pattern": "diabetes"},
    {"label": "DISEASE", "pattern": [{"LOWER": "heart"}, {"LOWER": "disease"}]},
]
ruler.add_patterns(patterns)

doc = nlp("The patient takes aspirin daily for heart disease prevention.")
for ent in doc.ents:
    print(f"{ent.text:20} {ent.label_}")
# aspirin              DRUG
# heart disease        DISEASE
```

## Sentence Segmentation and Text Processing

```python
nlp = spacy.load("en_core_web_sm")
doc = nlp("Dr. Smith went to Washington. He arrived on Monday. It was cold.")

# Sentence boundaries
for sent in doc.sents:
    print(f"[{sent.start}:{sent.end}] {sent.text}")

# Lemmatization
tokens_lemmatized = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]

# Filter by POS
nouns = [token.text for token in doc if token.pos_ == "NOUN"]
verbs = [token.text for token in doc if token.pos_ == "VERB"]

# Similarity (requires md or lg model with vectors)
nlp_md = spacy.load("en_core_web_md")
doc1 = nlp_md("I like cats")
doc2 = nlp_md("I love dogs")
print(f"Similarity: {doc1.similarity(doc2):.3f}")
```

## Best Practices

1. Use `en_core_web_sm` for speed; use `en_core_web_trf` when accuracy matters most.
2. Process text in batches with `nlp.pipe(texts)` for better throughput.
3. Disable unused pipeline components: `nlp.select_pipes(enable=["ner"])`.
4. Use `PhraseMatcher` for exact term lookups; it is much faster than token `Matcher`.
5. Add `EntityRuler` before `ner` to give rule-based patterns priority.
6. Use `doc.to_json()` to serialize processed documents for storage.
7. For large texts, increase `nlp.max_length` or split into paragraphs first.
8. Always download the model before first use: `python3 -m spacy download <model>`.

Related Skills

xurl

564

from beita6969/ScienceClaw

A CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.

xlsx

564

from beita6969/ScienceClaw

Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.

writing

564

from beita6969/ScienceClaw

No description provided.

world-bank-data

564

from beita6969/ScienceClaw

World Bank Open Data API for development indicators. Use when: user asks about GDP, population, poverty, health, or education statistics by country. NOT for: real-time financial data or stock prices.

wikipedia-search

564

from beita6969/ScienceClaw

Search and fetch structured content from Wikipedia using the MediaWiki API for reliable, encyclopedic information

wikidata-knowledge

564

from beita6969/ScienceClaw

Query Wikidata for structured knowledge using SPARQL and entity search. Use when: (1) finding structured facts about entities (people, places, organizations), (2) querying relationships between entities, (3) cross-referencing external identifiers (Wikipedia, VIAF, GND, ORCID), (4) building knowledge graphs from linked data. NOT for: full-text article content (use Wikipedia API), scientific literature (use semantic-scholar), geospatial data (use OpenStreetMap).

weather

564

from beita6969/ScienceClaw

Get current weather and forecasts via wttr.in or Open-Meteo. Use when: user asks about weather, temperature, or forecasts for any location. NOT for: historical weather data, severe weather alerts, or detailed meteorological analysis. No API key needed.

wacli

564

from beita6969/ScienceClaw

Send WhatsApp messages to other people or search/sync WhatsApp history via the wacli CLI (not for normal user chats).

voice-call

564

from beita6969/ScienceClaw

Start voice calls via the OpenClaw voice-call plugin.

visualization

564

from beita6969/ScienceClaw

Create publication-quality scientific figures and plots using Python (matplotlib, seaborn, plotly). Supports bar charts, scatter plots, heatmaps, box plots, violin plots, survival curves, network graphs, and more. Use when user asks to plot data, create figures, make charts, visualize results, or generate publication-ready graphics. Triggers on "plot", "chart", "figure", "graph", "visualize", "heatmap", "scatter plot", "bar chart", "histogram".

video-frames

564

from beita6969/ScienceClaw

Extract frames or short clips from videos using ffmpeg.

venue-templates

564

from beita6969/ScienceClaw

Access comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.