nlp-analysis

Natural language processing for research including text mining, sentiment analysis, topic modeling, named entity recognition, text classification, and corpus analysis. Use when user needs to analyze text data, extract information from documents, do sentiment analysis, topic modeling, or text classification for research purposes. Triggers on "text mining", "sentiment analysis", "topic modeling", "NER", "named entity", "text classification", "word embeddings", "LDA", "corpus analysis", "word frequency", "TF-IDF".

564 stars

Best use case

nlp-analysis is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Natural language processing for research including text mining, sentiment analysis, topic modeling, named entity recognition, text classification, and corpus analysis. Use when user needs to analyze text data, extract information from documents, do sentiment analysis, topic modeling, or text classification for research purposes. Triggers on "text mining", "sentiment analysis", "topic modeling", "NER", "named entity", "text classification", "word embeddings", "LDA", "corpus analysis", "word frequency", "TF-IDF".

Teams using nlp-analysis should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/nlp-analysis/SKILL.md --create-dirs "https://raw.githubusercontent.com/beita6969/ScienceClaw/main/skills/nlp-analysis/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/nlp-analysis/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How nlp-analysis Compares

Feature / Agentnlp-analysisStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

Natural language processing for research including text mining, sentiment analysis, topic modeling, named entity recognition, text classification, and corpus analysis. Use when user needs to analyze text data, extract information from documents, do sentiment analysis, topic modeling, or text classification for research purposes. Triggers on "text mining", "sentiment analysis", "topic modeling", "NER", "named entity", "text classification", "word embeddings", "LDA", "corpus analysis", "word frequency", "TF-IDF".

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# NLP Analysis

Text mining and NLP for scientific research. Venv: `source /Users/zhangmingda/clawd/.venv/bin/activate`

## Text Preprocessing

```python
import re
import nltk
from collections import Counter

# Download resources (first time)
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer

def preprocess(text, lang='english'):
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words(lang))
    tokens = [t for t in tokens if t not in stop_words and len(t) > 2]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return tokens
```

## Text Vectorization

```python
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# TF-IDF
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), stop_words='english')
X_tfidf = tfidf.fit_transform(documents)

# Get top terms per document
feature_names = tfidf.get_feature_names_out()
for i, doc in enumerate(documents[:3]):
    scores = X_tfidf[i].toarray().flatten()
    top_idx = scores.argsort()[-10:][::-1]
    print(f"Doc {i}: {[feature_names[j] for j in top_idx]}")
```

## Topic Modeling

### LDA (Latent Dirichlet Allocation)
```python
from sklearn.decomposition import LatentDirichletAllocation

n_topics = 10
lda = LatentDirichletAllocation(n_components=n_topics, random_state=42, max_iter=20)
lda.fit(X_count)  # use CountVectorizer, not TF-IDF

# Print top words per topic
for i, topic in enumerate(lda.components_):
    top_words = [feature_names[j] for j in topic.argsort()[-10:][::-1]]
    print(f"Topic {i}: {', '.join(top_words)}")

# Topic coherence: use gensim for proper evaluation
```

### BERTopic (neural topic modeling)
```python
# pip install bertopic
from bertopic import BERTopic

topic_model = BERTopic(language="english", nr_topics="auto")
topics, probs = topic_model.fit_transform(documents)
topic_model.get_topic_info()
```

## Sentiment Analysis

```python
# Simple lexicon-based
from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()
for text in texts:
    scores = sia.polarity_scores(text)
    print(f"{scores['compound']:.3f} | {text[:80]}")

# Transformer-based (more accurate)
from transformers import pipeline
sentiment = pipeline("sentiment-analysis")
results = sentiment(texts)
```

## Named Entity Recognition

```python
# Using transformers
from transformers import pipeline

ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Albert Einstein developed the theory of relativity at Princeton University.")
for e in entities:
    print(f"{e['entity_group']}: {e['word']} (score: {e['score']:.3f})")
```

## Text Classification

```python
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# TF-IDF + Logistic Regression (strong baseline)
pipe = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=10000, ngram_range=(1, 2))),
    ('clf', LogisticRegression(max_iter=1000))
])
scores = cross_val_score(pipe, texts, labels, cv=5, scoring='f1_macro')
print(f"F1 macro: {scores.mean():.3f} ± {scores.std():.3f}")
```

## Word Embeddings & Similarity

```python
from sklearn.metrics.pairwise import cosine_similarity

# Using TF-IDF vectors for document similarity
sim_matrix = cosine_similarity(X_tfidf)

# For word-level embeddings, use gensim Word2Vec or sentence-transformers
# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(documents)
```

## Corpus Statistics

```python
def corpus_stats(documents):
    all_tokens = [word_tokenize(doc.lower()) for doc in documents]
    all_words = [w for tokens in all_tokens for w in tokens if w.isalpha()]
    
    print(f"Documents: {len(documents)}")
    print(f"Total tokens: {len(all_words)}")
    print(f"Vocabulary size: {len(set(all_words))}")
    print(f"Type-token ratio: {len(set(all_words))/len(all_words):.4f}")
    print(f"Avg doc length: {np.mean([len(t) for t in all_tokens]):.1f} tokens")
    
    freq = Counter(all_words)
    print(f"Top 20 words: {freq.most_common(20)}")
```

## Chinese NLP

```python
# For Chinese text, use jieba for segmentation
# pip install jieba
import jieba

text = "自然语言处理是人工智能的重要方向"
words = list(jieba.cut(text))
print(" / ".join(words))

# For Chinese sentiment/NER, use transformers with Chinese models
# e.g., bert-base-chinese, hfl/chinese-roberta-wwm-ext
```

## Tips
- Always report preprocessing steps for reproducibility
- Use multiple topic numbers and evaluate coherence
- For small datasets, TF-IDF + classical ML often beats deep learning
- Report inter-annotator agreement for labeled datasets
- Consider domain-specific stop words and vocabularies
- For Chinese text, jieba segmentation is essential

Related Skills

statistical-analysis

564
from beita6969/ScienceClaw

Guided statistical analysis with test selection and reporting. Use when you need help choosing appropriate tests for your data, assumption checking, power analysis, and APA-formatted results. Best for academic research reporting, test selection guidance. For implementing specific models programmatically use statsmodels.

social-science-analysis

564
from beita6969/ScienceClaw

Social science research methods including survey design, qualitative analysis, content analysis, network analysis, psychometrics, and mixed methods. Covers sociology, psychology, political science, education, and communication studies. Use when user designs surveys, analyzes qualitative data, does content analysis, builds scales, or uses mixed methods. Triggers on "survey design", "qualitative analysis", "content analysis", "Likert scale", "thematic analysis", "grounded theory", "factor analysis", "SEM", "structural equation", "psychometrics", "interview coding".

scipy-analysis

564
from beita6969/ScienceClaw

Scientific computing and statistical analysis with SciPy, NumPy, and pandas. Use when: (1) statistical hypothesis testing, (2) optimization problems, (3) signal processing, (4) numerical integration, (5) data manipulation and analysis. NOT for: symbolic math (use sympy-math), machine learning (use sklearn directly), or visualization (use matplotlib-viz).

patent-analysis

564
from beita6969/ScienceClaw

Conducts patent landscape analysis including prior art searches, patent claim interpretation, freedom-to-operate assessment, and intellectual property strategy for scientific inventions; trigger when users discuss patents, prior art, IP protection, or technology licensing.

paper-analysis

564
from beita6969/ScienceClaw

Read, summarize, and critically analyze scientific papers. Extract key findings, methodology, limitations, and contributions. Use when user shares a paper (PDF/URL/DOI), asks to summarize a paper, critique methodology, extract data from a paper, compare papers, or do a critical review. Triggers on "summarize this paper", "analyze this study", "what does this paper say", "critique this methodology", "extract findings from".

meta-analysis

564
from beita6969/ScienceClaw

Perform quantitative meta-analysis with effect size calculation, forest plots, funnel plots, and heterogeneity assessment. Use when: user asks to combine results from multiple studies, calculate pooled effect sizes, assess publication bias, or create forest/funnel plots. NOT for: systematic review protocol (use systematic-review) or single-study statistics (use statsmodels-stats).

linguistics-analysis

564
from beita6969/ScienceClaw

Analyze language structures, typological features, and semantic change across languages

legal-analysis

564
from beita6969/ScienceClaw

Analyze legal contracts, extract clauses, and perform legal research with structured frameworks

geospatial-analysis

564
from beita6969/ScienceClaw

Performs geospatial data analysis including GIS operations, spatial statistics, remote sensing image processing, geocoding, and cartographic visualization; trigger when users discuss maps, coordinates, satellite imagery, spatial patterns, or geographic data.

genomics-analysis

564
from beita6969/ScienceClaw

Orchestrates a genomics analysis workflow from gene query through expression analysis to pathway enrichment. Use when investigating gene function, analyzing expression data, or performing pathway-level interpretation. NOT for pure protein structure modeling or drug-target interaction analysis.

genome-analysis

564
from beita6969/ScienceClaw

Performs genomics analyses including gene expression profiling, BLAST sequence alignment, GWAS interpretation, variant calling, and genome assembly tasks; trigger when the user mentions DNA/RNA sequences, SNPs, gene panels, or comparative genomics.

exploratory-data-analysis

564
from beita6969/ScienceClaw

Perform comprehensive exploratory data analysis on scientific data files across 200+ file formats. This skill should be used when analyzing any scientific data file to understand its structure, content, quality, and characteristics. Automatically detects file type and generates detailed markdown reports with format-specific analysis, quality metrics, and downstream analysis recommendations. Covers chemistry, bioinformatics, microscopy, spectroscopy, proteomics, metabolomics, and general scientific data formats.