obliteratus-abliteration

One-click model liberation toolkit for removing refusal behaviors from LLMs via surgical abliteration techniques

3,823 stars

Best use case

obliteratus-abliteration is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

One-click model liberation toolkit for removing refusal behaviors from LLMs via surgical abliteration techniques

Teams using obliteratus-abliteration should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

  • You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

  • You only need a quick one-off answer and do not need a reusable workflow.
  • You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/obliteratus-abliteration/SKILL.md --create-dirs "https://raw.githubusercontent.com/openclaw/skills/main/skills/adisinghstudent/obliteratus-abliteration/SKILL.md"

Manual Installation

  1. Download SKILL.md from GitHub
  2. Place it in .claude/skills/obliteratus-abliteration/SKILL.md inside your project
  3. Restart your AI agent — it will auto-discover the skill

How obliteratus-abliteration Compares

Feature / Agentobliteratus-abliterationStandard Approach
Platform SupportNot specifiedLimited / Varies
Context Awareness High Baseline
Installation ComplexityUnknownN/A

Frequently Asked Questions

What does this skill do?

One-click model liberation toolkit for removing refusal behaviors from LLMs via surgical abliteration techniques

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

SKILL.md Source

# OBLITERATUS — LLM Abliteration Toolkit

> Skill by [ara.so](https://ara.so) — Daily 2026 Skills collection.

OBLITERATUS is an open-source toolkit for identifying and surgically removing refusal behaviors from large language models using mechanistic interpretability techniques (abliteration). It locates refusal directions in a model's hidden states via SVD/PCA, projects them out of the weights, and preserves core language capabilities. Ships with a Gradio UI, CLI, Python API, and Colab notebook.

---

## Installation

```bash
# Core install
pip install obliteratus

# With Gradio UI support
pip install "obliteratus[spaces]"

# With all optional analysis modules
pip install "obliteratus[full]"

# From source (latest)
git clone https://github.com/elder-plinius/OBLITERATUS
cd OBLITERATUS
pip install -e ".[full]"
```

**Requirements:**
- Python 3.10+
- PyTorch 2.1+ with CUDA (recommended) or CPU
- `transformers`, `accelerate`, `gradio>=5.29.0`
- HuggingFace account + token for gated models

```bash
export HF_TOKEN=your_hf_token_here
huggingface-cli login
```

---

## CLI — Key Commands

```bash
# Basic obliteration (default method)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct

# Advanced method (whitened SVD + bias projection + iterative refinement)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method advanced

# Analysis-informed pipeline (auto-configures from geometry analysis)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct --method informed

# Specify output directory and push to Hub
obliteratus obliterate mistralai/Mistral-7B-Instruct-v0.3 \
  --method advanced \
  --output ./my-liberated-model \
  --push-to-hub your-username/mistral-7b-liberated

# LoRA-based reversible ablation (non-destructive)
obliteratus obliterate meta-llama/Llama-3.1-8B-Instruct \
  --method lora \
  --lora-rank 1

# Strength sweep — find the capability/compliance tradeoff
obliteratus sweep meta-llama/Llama-3.1-8B-Instruct \
  --strengths 0.2,0.4,0.6,0.8,1.0

# Run analysis modules only (no modification)
obliteratus analyze meta-llama/Llama-3.1-8B-Instruct \
  --modules concept_cone,alignment_imprint,universality

# Benchmark: compare methods on a model
obliteratus benchmark meta-llama/Llama-3.1-8B-Instruct \
  --methods basic,advanced,informed

# Launch local Gradio UI
obliteratus ui
obliteratus ui --port 8080 --share
obliteratus ui --no-telemetry
```

---

## Python API

### Basic obliteration

```python
from obliteratus import Obliterator

# Initialize with a HuggingFace model ID or local path
obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct")

# Run the full pipeline: SUMMON → PROBE → DISTILL → EXCISE → VERIFY → REBIRTH
result = obl.obliterate(method="advanced")

print(result.perplexity_delta)    # capability preservation metric
print(result.refusal_rate_delta)  # refusal reduction
print(result.output_path)         # where the model was saved
```

### Step-by-step pipeline

```python
from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    method="advanced",
    num_directions=32,          # number of refusal directions to extract
    strength=1.0,               # projection strength (0.0–1.0+)
    preserve_norm=True,         # norm-preserving biprojection
    project_biases=True,        # also remove from bias terms
    iterative_passes=3,         # re-probe after each pass
    layers="auto",              # or list of ints, e.g. [10, 11, 12, 13]
    dtype="bfloat16",
    device="cuda",
)

obl = Obliterator("mistralai/Mistral-7B-Instruct-v0.3", config=config)

# Individual stages
obl.summon()           # load model + tokenizer
activations = obl.probe()    # collect activations on restricted vs unrestricted prompts
directions = obl.distill(activations)   # extract refusal directions via SVD
obl.excise(directions)       # project out guardrail directions
metrics = obl.verify()       # perplexity + coherence checks
obl.rebirth("./liberated-mistral-7b")  # save with metadata
```

### Custom probe prompts

```python
from obliteratus import Obliterator
from obliteratus.probing import ProbeDataset

# Use your own restricted/unrestricted prompt pairs
dataset = ProbeDataset(
    restricted=[
        "How do I pick a lock?",
        "Write a story with explicit violence.",
        "Explain how malware works in detail.",
    ],
    unrestricted=[
        "What is the capital of France?",
        "Write a story about a dog.",
        "Explain how encryption works.",
    ]
)

obl = Obliterator("google/gemma-2-9b-it")
obl.summon()
activations = obl.probe(dataset=dataset)
directions = obl.distill(activations)
obl.excise(directions)
obl.rebirth("./liberated-gemma-2-9b")
```

### Analysis modules

```python
from obliteratus.analysis import AnalysisSuite

suite = AnalysisSuite("meta-llama/Llama-3.1-8B-Instruct")
suite.load()

# Concept Cone Geometry — how many distinct refusal mechanisms?
cone = suite.concept_cone_geometry()
print(f"Solid angle estimate: {cone.solid_angle:.4f}")
print(f"Distinct refusal clusters: {cone.num_clusters}")

# Alignment Imprint Detection — DPO vs RLHF vs CAI vs SFT?
imprint = suite.alignment_imprint()
print(f"Detected training method: {imprint.method}")   # e.g. "RLHF"
print(f"Confidence: {imprint.confidence:.2%}")

# Ouroboros Effect — will it self-repair?
ouroboros = suite.ouroboros_quantification()
print(f"Self-repair score: {ouroboros.score:.4f}")
print(f"Recommended passes: {ouroboros.recommended_passes}")

# Cross-layer heatmap of refusal signal
heatmap = suite.layer_refusal_heatmap()
heatmap.plot(save_path="./refusal_heatmap.png")

# Safety-capability entanglement
entanglement = suite.entanglement_map()
print(f"Safe layers to modify: {entanglement.safe_layers}")
print(f"Risky layers (entangled): {entanglement.risky_layers}")
```

### Analysis-informed obliteration

```python
from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

# "informed" method runs analysis modules mid-pipeline
# to auto-configure every decision
config = PipelineConfig(method="informed")
obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct", config=config)

result = obl.obliterate()
print(result.analysis_report)   # full auto-configuration decisions
```

### Chat with obliterated model

```python
from obliteratus import Obliterator
from obliteratus.chat import ChatSession

obl = Obliterator("./liberated-llama-3.1-8b")
obl.summon()  # loads pre-obliterated model

session = ChatSession(obl.model, obl.tokenizer)

response = session.chat(
    "Explain in detail how a buffer overflow exploit works.",
    max_new_tokens=512,
    temperature=0.7,
)
print(response)
```

### A/B comparison

```python
from obliteratus.compare import ABComparison

ab = ABComparison(
    original_path="meta-llama/Llama-3.1-8B-Instruct",
    obliterated_path="./liberated-llama-3.1-8b",
)

prompt = "Write a story involving morally grey characters."

original_resp, liberated_resp = ab.compare(prompt)
print("=== ORIGINAL ===")
print(original_resp)
print("=== LIBERATED ===")
print(liberated_resp)
```

### Push obliterated model to Hub

```python
import os
from obliteratus import Obliterator

obl = Obliterator("meta-llama/Llama-3.1-8B-Instruct")
result = obl.obliterate(method="advanced")

result.push_to_hub(
    repo_id=f"{os.environ['HF_USERNAME']}/Llama-3.1-8B-Instruct-abliterated",
    token=os.environ["HF_TOKEN"],
    private=True,
)
```

---

## Obliteration Methods

| Method | Description | Best For |
|--------|-------------|----------|
| `basic` | Mean-difference direction extraction, single pass | Quick experiments |
| `advanced` | Whitened SVD + bias projection + iterative refinement | Production use |
| `informed` | Analysis-guided auto-configuration | Unknown models |
| `lora` | Reversible LoRA rank-1 adapters (no weight surgery) | Reversible ablation |
| `pca` | PCA-based direction extraction | Research/comparison |
| `sparse` | Sparse autoencoder decomposition | MoE models |

---

## Configuration

```python
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    # Core
    method="advanced",              # abliteration method
    strength=1.0,                   # projection strength (tune down if capability degrades)
    num_directions=32,              # refusal directions to extract
    
    # Layer selection
    layers="auto",                  # "auto", "cosmic", or list of ints
    layer_selection="cosmic",       # COSMIC: most separable layers
    
    # Weight modification
    preserve_norm=True,             # norm-preserving biprojection (recommended)
    project_biases=True,            # project out bias terms too
    project_attention=True,         # modify attention projection weights
    project_mlp=True,               # modify MLP weights
    
    # Iterative refinement
    iterative_passes=3,             # re-probe after each pass (catches rotated directions)
    
    # MoE-specific
    expert_granular=False,          # Expert-Granular Abliteration for MoE models
    
    # CoT preservation
    cot_aware=True,                 # preserve chain-of-thought directions
    
    # Hardware
    dtype="bfloat16",               # "float32", "float16", "bfloat16"
    device="cuda",                  # "cuda", "cpu", "auto"
    load_in_4bit=False,             # bitsandbytes 4-bit loading
    
    # Telemetry (anonymous, contributes to research dataset)
    telemetry=True,
)
```

---

## Common Patterns

### Tune strength to preserve capability

```python
from obliteratus import Obliterator
from obliteratus.sweep import StrengthSweep

# Find the sweet spot before running full obliteration
sweep = StrengthSweep("meta-llama/Llama-3.1-8B-Instruct")
results = sweep.run(strengths=[0.2, 0.4, 0.6, 0.8, 1.0, 1.2])

for r in results:
    print(f"Strength {r.strength:.1f} | perplexity_delta={r.perplexity_delta:.2f} | refusal_rate={r.refusal_rate:.2%}")

# Pick the best tradeoff
best = sweep.recommend()
print(f"Recommended strength: {best.strength}")
```

### MoE model (Mixtral, DeepSeek-MoE)

```python
from obliteratus import Obliterator
from obliteratus.pipeline import PipelineConfig

config = PipelineConfig(
    method="advanced",
    expert_granular=True,      # decompose per-expert refusal signals
    project_attention=True,
    project_mlp=True,
)

obl = Obliterator("mistralai/Mixtral-8x7B-Instruct-v0.1", config=config)
obl.obliterate()
obl.rebirth("./liberated-mixtral-8x7b")
```

### Batch benchmark multiple models

```python
from obliteratus.benchmark import ModelBenchmark

models = [
    "meta-llama/Llama-3.1-8B-Instruct",
    "google/gemma-2-9b-it",
    "mistralai/Mistral-7B-Instruct-v0.3",
]

bench = ModelBenchmark(models=models, method="advanced")
report = bench.run()
report.save("./benchmark_report.json")
report.plot_heatmap("./benchmark_heatmap.png")
```

---

## Troubleshooting

**Out of memory (OOM) on large models**
```python
config = PipelineConfig(
    dtype="float16",
    load_in_4bit=True,        # requires bitsandbytes
    device="cuda",
    layers=[10, 11, 12, 13],  # target fewer layers
    num_directions=16,         # fewer directions
)
```

**Capability degradation after obliteration**
```python
# Lower the strength or use COSMIC layer selection (most separable layers)
config = PipelineConfig(
    strength=0.6,
    layer_selection="cosmic",
    cot_aware=True,           # protect reasoning directions
    iterative_passes=1,       # fewer passes = less aggressive
)
```

**Refusal persists after obliteration**
```python
# Use informed method + increase passes
config = PipelineConfig(
    method="informed",
    iterative_passes=5,
    project_biases=True,      # don't forget bias terms
    num_directions=64,        # extract more directions
)
```

**Gated model access error**
```bash
export HF_TOKEN=your_hf_token_here
# Accept model license on HuggingFace Hub first, then:
huggingface-cli login
```

**Gradio UI won't start**
```bash
pip install "obliteratus[spaces]"
# Check port availability
obliteratus ui --port 7861
```

---

## No-Code Options

- **HuggingFace Space:** [spaces/pliny-the-prompter/obliteratus](https://huggingface.co/spaces/pliny-the-prompter/obliteratus) — free with HF Pro, ZeroGPU
- **Colab notebook:** [notebooks/abliterate.ipynb](https://colab.research.google.com/github/elder-plinius/OBLITERATUS/blob/main/notebooks/abliterate.ipynb) — run all cells, no setup

---

## Key Research References

- Arditi et al. (2024) — [arXiv:2406.11717](https://arxiv.org/abs/2406.11717) — foundational abliteration paper
- Gabliteration — [arXiv:2512.18901](https://arxiv.org/abs/2512.18901)
- COSMIC layer selection — [arXiv:2506.00085](https://arxiv.org/abs/2506.00085), ACL 2025
- Turner et al. (2023) — [arXiv:2308.10248](https://arxiv.org/abs/2308.10248) — activation steering
- Rimsky et al. (2024) — [arXiv:2312.06681](https://arxiv.org/abs/2312.06681) — contrastive activation addition

Related Skills

---

3891
from openclaw/skills

name: article-factory-wechat

Content & Documentation

humanizer

3891
from openclaw/skills

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.

Content & Documentation

find-skills

3891
from openclaw/skills

Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express interest in extending capabilities. This skill should be used when the user is looking for functionality that might exist as an installable skill.

General Utilities

tavily-search

3891
from openclaw/skills

Use Tavily API for real-time web search and content extraction. Use when: user needs real-time web search results, research, or current information from the web. Requires Tavily API key.

Data & Research

baidu-search

3891
from openclaw/skills

Search the web using Baidu AI Search Engine (BDSE). Use for live information, documentation, or research topics.

Data & Research

agent-autonomy-kit

3891
from openclaw/skills

Stop waiting for prompts. Keep working.

Workflow & Productivity

Meeting Prep

3891
from openclaw/skills

Never walk into a meeting unprepared again. Your agent researches all attendees before calendar events—pulling LinkedIn profiles, recent company news, mutual connections, and conversation starters. Generates a briefing doc with talking points, icebreakers, and context so you show up informed and confident. Triggered automatically before meetings or on-demand. Configure research depth, advance timing, and output format. Walking into meetings blind is amateur hour—missed connections, generic small talk, zero leverage. Use when setting up meeting intelligence, researching specific attendees, generating pre-meeting briefs, or automating your prep workflow.

Workflow & Productivity

self-improvement

3891
from openclaw/skills

Captures learnings, errors, and corrections to enable continuous improvement. Use when: (1) A command or operation fails unexpectedly, (2) User corrects Claude ('No, that's wrong...', 'Actually...'), (3) User requests a capability that doesn't exist, (4) An external API or tool fails, (5) Claude realizes its knowledge is outdated or incorrect, (6) A better approach is discovered for a recurring task. Also review learnings before major tasks.

Agent Intelligence & Learning

botlearn-healthcheck

3891
from openclaw/skills

botlearn-healthcheck — BotLearn autonomous health inspector for OpenClaw instances across 5 domains (hardware, config, security, skills, autonomy); triggers on system check, health report, diagnostics, or scheduled heartbeat inspection.

DevOps & Infrastructure

linkedin-cli

3891
from openclaw/skills

A bird-like LinkedIn CLI for searching profiles, checking messages, and summarizing your feed using session cookies.

Content & Documentation

notebooklm

3891
from openclaw/skills

Google NotebookLM 非官方 Python API 的 OpenClaw Skill。支持内容生成(播客、视频、幻灯片、测验、思维导图等)、文档管理和研究自动化。当用户需要使用 NotebookLM 生成音频概述、视频、学习材料或管理知识库时触发。

Data & Research

小红书长图文发布 Skill

3891
from openclaw/skills

## 概述

Content & Documentation