assessment-design-guide
Psychometrics and educational assessment design for researchers
Best use case
assessment-design-guide is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Psychometrics and educational assessment design for researchers
Teams using assessment-design-guide should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/assessment-design-guide/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How assessment-design-guide Compares
| Feature / Agent | assessment-design-guide | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Psychometrics and educational assessment design for researchers
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Assessment Design Guide
A skill for designing, validating, and analyzing educational assessments using modern psychometric methods. Covers classical test theory, item response theory, test construction, validity evidence, and computerized adaptive testing.
## Classical Test Theory
### Reliability Analysis
Classical test theory (CTT) models observed scores as the sum of a true score and error:
```
X = T + E
```
Key reliability coefficients:
| Coefficient | Method | Interpretation |
|-------------|--------|----------------|
| Cronbach's alpha | Internal consistency | Homogeneity of items |
| Test-retest | Stability over time | Temporal consistency |
| Parallel forms | Equivalent test versions | Form equivalence |
| Split-half (Spearman-Brown) | Odd-even item split | Internal consistency |
| Inter-rater (Cohen's kappa) | Multiple raters | Scoring agreement |
```python
import numpy as np
import pandas as pd
def item_analysis(responses: pd.DataFrame, total_scores: pd.Series) -> pd.DataFrame:
"""
Classical item analysis: difficulty, discrimination, point-biserial.
responses: binary DataFrame (1=correct, 0=incorrect), items as columns.
total_scores: total test score for each examinee.
"""
results = []
for item in responses.columns:
scores = responses[item]
difficulty = scores.mean() # p-value (proportion correct)
# Point-biserial correlation
corr = scores.corr(total_scores)
# Upper-lower discrimination (top/bottom 27%)
n = len(total_scores)
cutoff_high = total_scores.quantile(0.73)
cutoff_low = total_scores.quantile(0.27)
upper = scores[total_scores >= cutoff_high].mean()
lower = scores[total_scores <= cutoff_low].mean()
discrimination = upper - lower
results.append({
"item": item,
"difficulty": round(difficulty, 3),
"discrimination": round(discrimination, 3),
"point_biserial": round(corr, 3),
"flag": "review" if difficulty < 0.2 or difficulty > 0.9
or discrimination < 0.2 else "ok"
})
return pd.DataFrame(results)
```
### Item Selection Guidelines
- **Difficulty**: Aim for p-values between 0.30 and 0.80 for maximum discrimination
- **Discrimination**: Items with D < 0.20 should be revised or removed
- **Distractors**: Each distractor should attract at least 5% of examinees
- **Point-biserial**: Should be positive and ideally above 0.25
## Item Response Theory
### The Three-Parameter Logistic Model
IRT provides a more rigorous framework than CTT by modeling the probability of a correct response as a function of ability and item parameters:
```python
import numpy as np
def irt_3pl(theta: float, a: float, b: float, c: float) -> float:
"""
Three-parameter logistic IRT model.
theta: examinee ability (typically -3 to +3)
a: discrimination parameter (slope, typically 0.5 to 2.5)
b: difficulty parameter (location, same scale as theta)
c: guessing parameter (lower asymptote, typically 0.0 to 0.35)
Returns: probability of correct response
"""
exponent = -a * (theta - b)
return c + (1 - c) / (1 + np.exp(exponent))
# Item characteristic curves for three items
thetas = np.linspace(-3, 3, 100)
item_easy = [irt_3pl(t, a=1.0, b=-1.0, c=0.2) for t in thetas]
item_medium = [irt_3pl(t, a=1.5, b=0.0, c=0.2) for t in thetas]
item_hard = [irt_3pl(t, a=1.2, b=1.5, c=0.2) for t in thetas]
```
### IRT Model Estimation
```python
# Using the 'mirt' package in R (called via rpy2 or standalone)
# R code for fitting a 2PL model:
r_code = """
library(mirt)
# responses: binary matrix (examinees x items)
mod <- mirt(responses, model = 1, itemtype = "2PL")
# Item parameters
coef(mod, simplify = TRUE)
# Ability estimates (Expected A Posteriori)
theta_hat <- fscores(mod, method = "EAP")
# Model fit
M2(mod) # limited-information fit statistic
itemfit(mod, fit_stats = "S_X2")
"""
```
### Model Comparison
| Model | Parameters | Use Case |
|-------|-----------|----------|
| Rasch (1PL) | b only | Equal discrimination assumed; measurement-focused |
| 2PL | a, b | Different discrimination; general purpose |
| 3PL | a, b, c | Multiple choice with guessing |
| Graded Response | a, b_k | Likert-scale or partial credit items |
| Nominal Response | a_k, c_k | Multiple choice with informative distractors |
## Validity Evidence
### The Unified Validity Framework
Following the Standards for Educational and Psychological Testing (AERA/APA/NCME, 2014), validity is a unitary concept supported by five types of evidence:
1. **Content evidence**: Expert review confirms items represent the construct domain
2. **Response process evidence**: Think-aloud protocols confirm examinees engage intended cognitive processes
3. **Internal structure evidence**: Factor analysis confirms dimensionality matches the test blueprint
4. **Relations to other variables**: Correlations with external criteria (convergent, discriminant, predictive)
5. **Consequences evidence**: Test use leads to intended benefits without unintended harm
```python
from factor_analyzer import FactorAnalyzer
# Confirmatory approach: check dimensionality
fa = FactorAnalyzer(n_factors=3, rotation="promax")
fa.fit(item_responses)
# Eigenvalues for scree plot
eigenvalues, _ = fa.get_eigenvalues()
print("Eigenvalues:", eigenvalues[:10])
# Factor loadings
loadings = pd.DataFrame(
fa.loadings_,
columns=["Factor1", "Factor2", "Factor3"],
index=item_names
)
print(loadings.round(3))
```
## Computerized Adaptive Testing
### CAT Algorithm
Computerized adaptive testing selects items in real time to match examinee ability:
```
Initialize: theta_0 = 0 (prior mean)
For each item i = 1, 2, ..., until stopping rule met:
1. Select item with maximum Fisher information at current theta
2. Administer item, observe response
3. Update theta estimate using maximum likelihood or Bayesian EAP
4. Check stopping rule:
- Fixed length (e.g., 30 items)
- SE(theta) < threshold (e.g., 0.30)
- Maximum time reached
Return: final theta estimate and standard error
```
### Item Exposure Control
To prevent overuse of high-quality items and maintain test security:
- **Sympson-Hetter method**: Set maximum exposure rates per item (e.g., 0.25)
- **a-stratified method**: Divide item bank into strata by discrimination, sample within strata
- **Shadow test approach**: Assemble full shadow tests at each step, administer the optimal item from the shadow test
## Tools and Software
- **R mirt package**: Full-featured IRT estimation, DIF analysis, CAT simulation
- **Python irt library (py-irt)**: Bayesian IRT models using PyTorch
- **jMetrik**: Open-source Java application for classical and IRT analysis
- **TAO (Testing Assistee par Ordinateur)**: Open-source assessment delivery platform
- **Concerto**: Open-source adaptive testing platform from Cambridge
## Key References
- Embretson, S.E. and Reise, S.P. (2000). *Item Response Theory for Psychologists*. Lawrence Erlbaum.
- de Ayala, R.J. (2022). *The Theory and Practice of Item Response Theory* (2nd ed.). Guilford Press.
- AERA, APA, and NCME (2014). *Standards for Educational and Psychological Testing*.Related Skills
thuthesis-guide
Write Tsinghua University theses using the ThuThesis LaTeX template
thesis-writing-guide
Templates, formatting rules, and strategies for thesis and dissertation writing
thesis-template-guide
Set up LaTeX templates for PhD and Master's thesis documents
sjtuthesis-guide
Write SJTU theses using the SJTUThesis LaTeX template with full compliance
novathesis-guide
LaTeX thesis template supporting multiple universities and formats
graphical-abstract-guide
Create SVG graphical abstracts for journal paper submissions
beamer-presentation-guide
Guide to creating academic presentations with LaTeX Beamer
plagiarism-detection-guide
Use plagiarism detection tools and ensure manuscript originality
paper-polish-guide
Review and polish LaTeX research papers for clarity and style
grammar-checker-guide
Use grammar and style checking tools to polish academic manuscripts
conciseness-editing-guide
Eliminate wordiness and redundancy in academic prose for clarity
academic-translation-guide
Academic translation, post-editing, and Chinglish correction guide