spacy-nlp
Natural language processing with spaCy. Use when: (1) named entity recognition, (2) POS tagging and dependency parsing, (3) text tokenization and linguistic analysis, (4) rule-based pattern matching, (5) custom NER pipelines. NOT for: LLM inference or text generation (use transformers), sentiment analysis at scale (use dedicated models), or machine translation.
Best use case
spacy-nlp is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Natural language processing with spaCy. Use when: (1) named entity recognition, (2) POS tagging and dependency parsing, (3) text tokenization and linguistic analysis, (4) rule-based pattern matching, (5) custom NER pipelines. NOT for: LLM inference or text generation (use transformers), sentiment analysis at scale (use dedicated models), or machine translation.
Teams using spacy-nlp should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/spacy-nlp/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How spacy-nlp Compares
| Feature / Agent | spacy-nlp | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Natural language processing with spaCy. Use when: (1) named entity recognition, (2) POS tagging and dependency parsing, (3) text tokenization and linguistic analysis, (4) rule-based pattern matching, (5) custom NER pipelines. NOT for: LLM inference or text generation (use transformers), sentiment analysis at scale (use dedicated models), or machine translation.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# spaCy NLP
Natural language processing using spaCy for tokenization, named entity
recognition, dependency parsing, and linguistic analysis.
## When to Use
- Named entity recognition (NER) on text
- Part-of-speech tagging and morphological analysis
- Dependency parsing and syntactic analysis
- Rule-based pattern matching in text
- Custom NER pipeline creation
- Tokenization and sentence segmentation
- Lemmatization and linguistic feature extraction
## When NOT to Use
- LLM inference or text generation (use transformers/huggingface)
- Sentiment analysis at scale (use fine-tuned classifiers)
- Machine translation (use dedicated MT models)
- Topic modeling (use gensim or sklearn)
- Simple regex-only text search (use re module)
## Setup and Model Download
```bash
# Download models (run once before using)
python3 -m spacy download en_core_web_sm # small, fast, ~12MB
python3 -m spacy download en_core_web_md # medium with word vectors, ~40MB
python3 -m spacy download en_core_web_trf # transformer-based, most accurate
```
## Basic Pipeline
```python
import spacy
# Load a model
nlp = spacy.load("en_core_web_sm")
# Process text
doc = nlp("Apple is looking at buying U.K. startup for $1 billion.")
# Token-level attributes
for token in doc:
print(f"{token.text:12} {token.pos_:6} {token.dep_:10} {token.lemma_}")
# Apple PROPN nsubj Apple
# is AUX aux be
# looking VERB ROOT look
# ...
```
## Named Entity Recognition
```python
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976.")
# Extract entities
for ent in doc.ents:
print(f"{ent.text:25} {ent.label_:10} {ent.start_char}-{ent.end_char}")
# Apple Inc. ORG 0-10
# Steve Jobs PERSON 26-36
# Cupertino, California GPE 40-61
# 1976 DATE 65-69
# Common entity labels: PERSON, ORG, GPE, DATE, MONEY, PRODUCT, EVENT, LOC
# Get explanation of labels
print(spacy.explain("GPE")) # "Countries, cities, states"
```
## Dependency Parsing
```python
nlp = spacy.load("en_core_web_sm")
doc = nlp("The quick brown fox jumps over the lazy dog.")
# Dependency tree
for token in doc:
print(f"{token.text:10} --{token.dep_:10}--> {token.head.text}")
# Noun chunks (base noun phrases)
for chunk in doc.noun_chunks:
print(f"{chunk.text:25} root={chunk.root.text}, head={chunk.root.head.text}")
# The quick brown fox root=fox, head=jumps
# the lazy dog root=dog, head=over
# Find subject and object of a verb
for token in doc:
if token.dep_ == "nsubj":
print(f"Subject: {token.text} of verb: {token.head.text}")
if token.dep_ == "dobj":
print(f"Object: {token.text} of verb: {token.head.text}")
```
## Pattern Matching
```python
from spacy.matcher import Matcher, PhraseMatcher
nlp = spacy.load("en_core_web_sm")
# Token-based pattern matching
matcher = Matcher(nlp.vocab)
# Pattern: adjective followed by one or more nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN", "OP": "+"}]
matcher.add("ADJ_NOUN", [pattern])
doc = nlp("The bright blue sky and cold winter morning greeted us.")
matches = matcher(doc)
for match_id, start, end in matches:
span = doc[start:end]
print(f"Match: {span.text}")
# Match: bright blue sky
# Match: cold winter morning
# Phrase matching (exact phrase lookup, very fast)
phrase_matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
terms = ["machine learning", "deep learning", "natural language processing"]
patterns = [nlp.make_doc(term) for term in terms]
phrase_matcher.add("TECH_TERMS", patterns)
doc = nlp("This paper covers machine learning and natural language processing.")
matches = phrase_matcher(doc)
for match_id, start, end in matches:
print(f"Found: {doc[start:end].text}")
```
## Custom Entity Rules
```python
from spacy.pipeline import EntityRuler
nlp = spacy.load("en_core_web_sm")
# Add entity ruler before the NER component
ruler = nlp.add_pipe("entity_ruler", before="ner")
patterns = [
{"label": "DRUG", "pattern": "aspirin"},
{"label": "DRUG", "pattern": [{"LOWER": "vitamin"}, {"LOWER": "d"}]},
{"label": "DISEASE", "pattern": "diabetes"},
{"label": "DISEASE", "pattern": [{"LOWER": "heart"}, {"LOWER": "disease"}]},
]
ruler.add_patterns(patterns)
doc = nlp("The patient takes aspirin daily for heart disease prevention.")
for ent in doc.ents:
print(f"{ent.text:20} {ent.label_}")
# aspirin DRUG
# heart disease DISEASE
```
## Sentence Segmentation and Text Processing
```python
nlp = spacy.load("en_core_web_sm")
doc = nlp("Dr. Smith went to Washington. He arrived on Monday. It was cold.")
# Sentence boundaries
for sent in doc.sents:
print(f"[{sent.start}:{sent.end}] {sent.text}")
# Lemmatization
tokens_lemmatized = [token.lemma_ for token in doc if not token.is_stop and not token.is_punct]
# Filter by POS
nouns = [token.text for token in doc if token.pos_ == "NOUN"]
verbs = [token.text for token in doc if token.pos_ == "VERB"]
# Similarity (requires md or lg model with vectors)
nlp_md = spacy.load("en_core_web_md")
doc1 = nlp_md("I like cats")
doc2 = nlp_md("I love dogs")
print(f"Similarity: {doc1.similarity(doc2):.3f}")
```
## Best Practices
1. Use `en_core_web_sm` for speed; use `en_core_web_trf` when accuracy matters most.
2. Process text in batches with `nlp.pipe(texts)` for better throughput.
3. Disable unused pipeline components: `nlp.select_pipes(enable=["ner"])`.
4. Use `PhraseMatcher` for exact term lookups; it is much faster than token `Matcher`.
5. Add `EntityRuler` before `ner` to give rule-based patterns priority.
6. Use `doc.to_json()` to serialize processed documents for storage.
7. For large texts, increase `nlp.max_length` or split into paragraphs first.
8. Always download the model before first use: `python3 -m spacy download <model>`.Related Skills
xurl
A CLI tool for making authenticated requests to the X (Twitter) API. Use this skill when you need to post tweets, reply, quote, search, read posts, manage followers, send DMs, upload media, or interact with any X API v2 endpoint.
xlsx
Use this skill any time a spreadsheet file is the primary input or output. This means any task where the user wants to: open, read, edit, or fix an existing .xlsx, .xlsm, .csv, or .tsv file (e.g., adding columns, computing formulas, formatting, charting, cleaning messy data); create a new spreadsheet from scratch or from other data sources; or convert between tabular file formats. Trigger especially when the user references a spreadsheet file by name or path — even casually (like "the xlsx in my downloads") — and wants something done to it or produced from it. Also trigger for cleaning or restructuring messy tabular data files (malformed rows, misplaced headers, junk data) into proper spreadsheets. The deliverable must be a spreadsheet file. Do NOT trigger when the primary deliverable is a Word document, HTML report, standalone Python script, database pipeline, or Google Sheets API integration, even if tabular data is involved.
writing
No description provided.
world-bank-data
World Bank Open Data API for development indicators. Use when: user asks about GDP, population, poverty, health, or education statistics by country. NOT for: real-time financial data or stock prices.
wikipedia-search
Search and fetch structured content from Wikipedia using the MediaWiki API for reliable, encyclopedic information
wikidata-knowledge
Query Wikidata for structured knowledge using SPARQL and entity search. Use when: (1) finding structured facts about entities (people, places, organizations), (2) querying relationships between entities, (3) cross-referencing external identifiers (Wikipedia, VIAF, GND, ORCID), (4) building knowledge graphs from linked data. NOT for: full-text article content (use Wikipedia API), scientific literature (use semantic-scholar), geospatial data (use OpenStreetMap).
weather
Get current weather and forecasts via wttr.in or Open-Meteo. Use when: user asks about weather, temperature, or forecasts for any location. NOT for: historical weather data, severe weather alerts, or detailed meteorological analysis. No API key needed.
wacli
Send WhatsApp messages to other people or search/sync WhatsApp history via the wacli CLI (not for normal user chats).
voice-call
Start voice calls via the OpenClaw voice-call plugin.
visualization
Create publication-quality scientific figures and plots using Python (matplotlib, seaborn, plotly). Supports bar charts, scatter plots, heatmaps, box plots, violin plots, survival curves, network graphs, and more. Use when user asks to plot data, create figures, make charts, visualize results, or generate publication-ready graphics. Triggers on "plot", "chart", "figure", "graph", "visualize", "heatmap", "scatter plot", "bar chart", "histogram".
video-frames
Extract frames or short clips from videos using ffmpeg.
venue-templates
Access comprehensive LaTeX templates, formatting requirements, and submission guidelines for major scientific publication venues (Nature, Science, PLOS, IEEE, ACM), academic conferences (NeurIPS, ICML, CVPR, CHI), research posters, and grant proposals (NSF, NIH, DOE, DARPA). This skill should be used when preparing manuscripts for journal submission, conference papers, research posters, or grant proposals and need venue-specific formatting requirements and templates.