nowait-reasoning-optimizer
Implements the NOWAIT technique for efficient reasoning in R1-style LLMs. Use when optimizing inference of reasoning models (QwQ, DeepSeek-R1, Phi4-Reasoning, Qwen3, Kimi-VL, QvQ), reducing chain-of-thought token usage by 27-51% while preserving accuracy. Triggers on "optimize reasoning", "reduce thinking tokens", "efficient inference", "suppress reflection tokens", or when working with verbose CoT outputs.
Best use case
nowait-reasoning-optimizer is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Implements the NOWAIT technique for efficient reasoning in R1-style LLMs. Use when optimizing inference of reasoning models (QwQ, DeepSeek-R1, Phi4-Reasoning, Qwen3, Kimi-VL, QvQ), reducing chain-of-thought token usage by 27-51% while preserving accuracy. Triggers on "optimize reasoning", "reduce thinking tokens", "efficient inference", "suppress reflection tokens", or when working with verbose CoT outputs.
Teams using nowait-reasoning-optimizer should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/nowait/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How nowait-reasoning-optimizer Compares
| Feature / Agent | nowait-reasoning-optimizer | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Implements the NOWAIT technique for efficient reasoning in R1-style LLMs. Use when optimizing inference of reasoning models (QwQ, DeepSeek-R1, Phi4-Reasoning, Qwen3, Kimi-VL, QvQ), reducing chain-of-thought token usage by 27-51% while preserving accuracy. Triggers on "optimize reasoning", "reduce thinking tokens", "efficient inference", "suppress reflection tokens", or when working with verbose CoT outputs.
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
Related Guides
AI Agents for Coding
Browse AI agent skills for coding, debugging, testing, refactoring, code review, and developer workflows across Claude, Cursor, and Codex.
Best AI Skills for Claude
Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.
ChatGPT vs Claude for Agent Skills
Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.
SKILL.md Source
# NOWAIT Reasoning Optimizer
Implements the NOWAIT technique from the paper "Wait, We Don't Need to 'Wait'! Removing Thinking Tokens Improves Reasoning Efficiency" (Wang et al., 2025).
## Overview
NOWAIT is a training-free inference-time intervention that suppresses self-reflection tokens (e.g., "Wait", "Hmm", "Alternatively") during generation, reducing chain-of-thought (CoT) trajectory length by **27-51%** without compromising model utility.
## When to Use
- Deploying R1-style reasoning models with limited compute
- Reducing inference latency for production systems
- Optimizing token costs for reasoning tasks
- Working with verbose CoT outputs that need streamlining
## Supported Models
| Model Series | Type | Token Reduction |
|--------------|------|-----------------|
| QwQ-32B | RL-based | 16-31% |
| Phi4-Reasoning-Plus | RL-based | 23-28% |
| Qwen3-32B | RL-based | 13-16% |
| Kimi-VL-A3B | Multimodal | 40-60% |
| QvQ-72B-Preview | Multimodal | 20-30% |
**Important**: NOWAIT works best with RL-based models. Distilled models (Qwen3-4B/8B/14B) show degraded performance when reflection tokens are suppressed.
## Quick Start
### 1. Basic Implementation
```python
from scripts.nowait_processor import NOWAITLogitProcessor
# Initialize processor for your model's tokenizer
processor = NOWAITLogitProcessor(tokenizer)
# Use during generation
outputs = model.generate(
inputs,
logits_processor=[processor],
max_new_tokens=32768
)
```
### 2. Keywords Suppressed
See `references/keywords.md` for the complete list. Core keywords:
```
wait, alternatively, hmm, but, however, check,
double-check, maybe, verify, again, oh, ah
```
## How It Works
1. **Initialize Keywords**: Identify reflection keywords from empirical analysis
2. **Expand to Token Variants**: Map keywords to all token variants in vocabulary (e.g., "wait" → " wait", "Wait", " Wait", ".wait", "WAIT")
3. **Suppress During Inference**: Set logits of reflection tokens to large negative values during decoding
```
Logits (Before) Logits (After)
Wait 0.8 → Wait -inf
First 0.6 → First 0.6
Hmm 0.5 → Hmm -inf
Let 0.4 → Let 0.4
```
## Key Findings
### Why It Works
- NOWAIT doesn't eliminate self-reflection entirely—it guides models to skip **unnecessary** "waiting" reasoning
- Models still perform essential verification at key decision points
- Results in more linear, straightforward reasoning paths
### RL vs Distilled Models
| Model Type | NOWAIT Effect | Recommendation |
|------------|---------------|----------------|
| RL-based (QwQ, Phi4, Qwen3-32B) | Stable accuracy, significant token reduction | ✅ Recommended |
| Distilled (Qwen3-4B/8B/14B) | Accuracy degradation on hard tasks | ⚠️ Use with caution |
Distilled models rely heavily on CoT structure from training data—removing reflection tokens disrupts their reasoning patterns.
## Integration Examples
### HuggingFace Transformers
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from scripts.nowait_processor import NOWAITLogitProcessor
model = AutoModelForCausalLM.from_pretrained("Qwen/QwQ-32B")
tokenizer = AutoTokenizer.from_pretrained("Qwen/QwQ-32B")
processor = NOWAITLogitProcessor(tokenizer)
response = model.generate(
tokenizer(prompt, return_tensors="pt").input_ids,
logits_processor=[processor],
max_new_tokens=32768,
do_sample=True,
temperature=0.7
)
```
### vLLM
```python
from vllm import LLM, SamplingParams
from scripts.nowait_processor import get_nowait_bad_words_ids
llm = LLM(model="Qwen/QwQ-32B")
bad_words_ids = get_nowait_bad_words_ids(llm.get_tokenizer())
sampling_params = SamplingParams(
max_tokens=32768,
bad_words_ids=bad_words_ids
)
```
## Expected Results
| Task Type | Original Tokens | NOWAIT Tokens | Reduction |
|-----------|-----------------|---------------|-----------|
| Math (AIME) | 15,000 | 10,500 | 30% |
| Visual QA (MMMU) | 2,900 | 1,450 | 50% |
| Video QA (MMVU) | 1,700 | 1,250 | 27% |
## Limitations
- Less effective on very simple problems where CoT overhead is already minimal
- Distilled models may suffer accuracy loss on challenging tasks
- Some domains may require model-specific keyword tuning
## References
- Paper: arXiv:2506.08343v2
- Complete keyword list: `references/keywords.md`
- Implementation: `scripts/nowait_processor.py`Related Skills
performance-optimizer
Identifies and fixes performance bottlenecks in code, databases, and APIs. Measures before and after to prove improvements.
database-optimizer
Expert database optimizer specializing in modern performance tuning, query optimization, and scalable architectures.
SEO Optimizer
Search Engine Optimization specialist for content strategy, technical SEO, keyword research, and ranking improvements. Use when optimizing website content, improving search rankings, conducting keyword analysis, or implementing SEO best practices. Expert in on-page SEO, meta tags, schema markup, and Core Web Vitals.
async-python-patterns
Comprehensive guidance for implementing asynchronous Python applications using asyncio, concurrent programming patterns, and async/await for building high-performance, non-blocking systems.
slack-automation
Automate Slack workspace operations including messaging, search, channel management, and reaction workflows through Composio's Slack toolkit.
linear-automation
Automate Linear tasks via Rube MCP (Composio): issues, projects, cycles, teams, labels. Always search tools first for current schemas.
jira-automation
Automate Jira tasks via Rube MCP (Composio): issues, projects, sprints, boards, comments, users. Always search tools first for current schemas.
gitops-workflow
Complete guide to implementing GitOps workflows with ArgoCD and Flux for automated Kubernetes deployments.
github-automation
Automate GitHub repositories, issues, pull requests, branches, CI/CD, and permissions via Rube MCP (Composio). Manage code workflows, review PRs, search code, and handle deployments programmatically.
github-actions-templates
Production-ready GitHub Actions workflow patterns for testing, building, and deploying applications.
zustand-store-ts
Create Zustand stores following established patterns with proper TypeScript types and middleware.
zod-validation-expert
Expert in Zod — TypeScript-first schema validation. Covers parsing, custom errors, refinements, type inference, and integration with React Hook Form, Next.js, and tRPC.