reinforcement-learning-guide
Reinforcement learning fundamentals, algorithms, and research
Best use case
reinforcement-learning-guide is best used when you need a repeatable AI agent workflow instead of a one-off prompt.
Reinforcement learning fundamentals, algorithms, and research
Teams using reinforcement-learning-guide should expect a more consistent output, faster repeated execution, less prompt rewriting.
When to use this skill
- You want a reusable workflow that can be run more than once with consistent structure.
When not to use this skill
- You only need a quick one-off answer and do not need a reusable workflow.
- You cannot install or maintain the underlying files, dependencies, or repository context.
Installation
Claude Code / Cursor / Codex
Manual Installation
- Download SKILL.md from GitHub
- Place it in
.claude/skills/reinforcement-learning-guide/SKILL.mdinside your project - Restart your AI agent — it will auto-discover the skill
How reinforcement-learning-guide Compares
| Feature / Agent | reinforcement-learning-guide | Standard Approach |
|---|---|---|
| Platform Support | Not specified | Limited / Varies |
| Context Awareness | High | Baseline |
| Installation Complexity | Unknown | N/A |
Frequently Asked Questions
What does this skill do?
Reinforcement learning fundamentals, algorithms, and research
Where can I find the source code?
You can find the source code on GitHub using the link provided at the top of the page.
SKILL.md Source
# Reinforcement Learning Guide
Understand and implement reinforcement learning algorithms from tabular methods through deep RL, including policy gradients, actor-critic, and model-based approaches.
## RL Fundamentals
### The RL Framework
An agent interacts with an environment to maximize cumulative reward:
```
Agent Environment
| |
|--- action a_t ---------->|
| |--- next state s_{t+1}
|<-- reward r_t, state s_t |--- reward r_{t+1}
| |
```
| Concept | Symbol | Definition |
|---------|--------|-----------|
| State | s | Observation of the environment |
| Action | a | Decision made by the agent |
| Reward | r | Scalar feedback signal |
| Policy | pi(a\|s) | Mapping from states to actions |
| Value function | V(s) | Expected cumulative reward from state s |
| Q-function | Q(s, a) | Expected cumulative reward from (s, a) |
| Discount factor | gamma | Weight of future vs. immediate rewards (0-1) |
| Return | G_t | Sum of discounted future rewards from time t |
### Key Equations
```
# Return (discounted cumulative reward)
G_t = r_t + gamma * r_{t+1} + gamma^2 * r_{t+2} + ...
# Bellman equation for V
V(s) = E[r + gamma * V(s') | s]
# Bellman equation for Q
Q(s, a) = E[r + gamma * max_a' Q(s', a') | s, a]
# Policy gradient theorem
gradient J(theta) = E[gradient log pi_theta(a|s) * Q(s, a)]
```
## Algorithm Taxonomy
| Category | Algorithm | Key Idea | On/Off Policy |
|----------|-----------|----------|--------------|
| **Value-based** | Q-Learning | Learn Q(s,a), act greedily | Off-policy |
| | DQN | Q-Learning + neural net + replay buffer | Off-policy |
| | Double DQN | Two networks to reduce overestimation | Off-policy |
| | Dueling DQN | Separate value and advantage streams | Off-policy |
| **Policy gradient** | REINFORCE | Monte Carlo policy gradient | On-policy |
| | PPO | Clipped surrogate objective | On-policy |
| | TRPO | Trust region constraint | On-policy |
| **Actor-Critic** | A2C/A3C | Advantage actor-critic (parallel) | On-policy |
| | SAC | Maximum entropy + off-policy AC | Off-policy |
| | TD3 | Twin delayed DDPG | Off-policy |
| **Model-based** | Dreamer | World model + imagination | On-policy |
| | MBPO | Model-based policy optimization | Off-policy |
| | MuZero | Learned model + planning (MCTS) | Off-policy |
## Implementation: DQN
```python
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random
class QNetwork(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=128):
super().__init__()
self.net = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim)
)
def forward(self, x):
return self.net(x)
class DQNAgent:
def __init__(self, state_dim, action_dim, lr=1e-3, gamma=0.99,
epsilon=1.0, epsilon_decay=0.995, epsilon_min=0.01,
buffer_size=10000, batch_size=64):
self.action_dim = action_dim
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.batch_size = batch_size
self.q_network = QNetwork(state_dim, action_dim)
self.target_network = QNetwork(state_dim, action_dim)
self.target_network.load_state_dict(self.q_network.state_dict())
self.optimizer = optim.Adam(self.q_network.parameters(), lr=lr)
self.replay_buffer = deque(maxlen=buffer_size)
def select_action(self, state):
if random.random() < self.epsilon:
return random.randint(0, self.action_dim - 1)
with torch.no_grad():
q_values = self.q_network(torch.FloatTensor(state))
return q_values.argmax().item()
def store_transition(self, state, action, reward, next_state, done):
self.replay_buffer.append((state, action, reward, next_state, done))
def train_step(self):
if len(self.replay_buffer) < self.batch_size:
return 0.0
batch = random.sample(self.replay_buffer, self.batch_size)
states, actions, rewards, next_states, dones = zip(*batch)
states = torch.FloatTensor(np.array(states))
actions = torch.LongTensor(actions)
rewards = torch.FloatTensor(rewards)
next_states = torch.FloatTensor(np.array(next_states))
dones = torch.FloatTensor(dones)
# Current Q values
q_values = self.q_network(states).gather(1, actions.unsqueeze(1)).squeeze()
# Target Q values (Double DQN variant)
with torch.no_grad():
best_actions = self.q_network(next_states).argmax(1)
next_q = self.target_network(next_states).gather(1, best_actions.unsqueeze(1)).squeeze()
targets = rewards + self.gamma * next_q * (1 - dones)
loss = nn.MSELoss()(q_values, targets)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
self.epsilon = max(self.epsilon_min, self.epsilon * self.epsilon_decay)
return loss.item()
def update_target(self):
self.target_network.load_state_dict(self.q_network.state_dict())
```
## Implementation: PPO
```python
class PPOAgent:
def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99,
lam=0.95, clip_ratio=0.2, epochs=10):
self.gamma = gamma
self.lam = lam
self.clip_ratio = clip_ratio
self.epochs = epochs
self.actor = nn.Sequential(
nn.Linear(state_dim, 64), nn.Tanh(),
nn.Linear(64, 64), nn.Tanh(),
nn.Linear(64, action_dim), nn.Softmax(dim=-1)
)
self.critic = nn.Sequential(
nn.Linear(state_dim, 64), nn.Tanh(),
nn.Linear(64, 64), nn.Tanh(),
nn.Linear(64, 1)
)
self.optimizer = optim.Adam(
list(self.actor.parameters()) + list(self.critic.parameters()), lr=lr
)
def compute_gae(self, rewards, values, dones):
"""Generalized Advantage Estimation."""
advantages = []
gae = 0
for t in reversed(range(len(rewards))):
next_value = values[t + 1] if t + 1 < len(values) else 0
delta = rewards[t] + self.gamma * next_value * (1 - dones[t]) - values[t]
gae = delta + self.gamma * self.lam * (1 - dones[t]) * gae
advantages.insert(0, gae)
return torch.FloatTensor(advantages)
def update(self, states, actions, old_log_probs, rewards, dones):
values = self.critic(states).squeeze().detach().numpy()
advantages = self.compute_gae(rewards, values, dones)
returns = advantages + torch.FloatTensor(values[:len(advantages)])
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
for _ in range(self.epochs):
probs = self.actor(states)
dist = torch.distributions.Categorical(probs)
new_log_probs = dist.log_prob(actions)
entropy = dist.entropy().mean()
ratio = (new_log_probs - old_log_probs).exp()
clipped = torch.clamp(ratio, 1 - self.clip_ratio, 1 + self.clip_ratio)
actor_loss = -torch.min(ratio * advantages, clipped * advantages).mean()
critic_loss = nn.MSELoss()(self.critic(states).squeeze(), returns)
loss = actor_loss + 0.5 * critic_loss - 0.01 * entropy
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
```
## Research Environments
| Environment | Domain | Complexity | Key Paper |
|-------------|--------|-----------|-----------|
| Gymnasium (ex-Gym) | Classic control, Atari | Low-High | Brockman et al., 2016 |
| MuJoCo | Continuous control, robotics | Medium-High | Todorov et al., 2012 |
| DMControl | Continuous control from pixels | High | Tassa et al., 2018 |
| ProcGen | Procedurally generated games | High (generalization) | Cobbe et al., 2020 |
| Minigrid | Grid-world navigation | Low-Medium | Chevalier-Boisvert et al. |
| Isaac Gym | GPU-accelerated physics sim | High | Makoviychuk et al., 2021 |
| NetHack | Complex roguelike game | Very High | Kuttler et al., 2020 |
## Top Venues
| Venue | Type | Focus |
|-------|------|-------|
| NeurIPS | Conference | Broad ML including RL |
| ICML | Conference | Broad ML including RL |
| ICLR | Conference | Representation learning, deep RL |
| AAAI | Conference | Broad AI |
| CoRL | Conference | Robot learning |
| JMLR | Journal | Broad ML (open access) |
| L4DC | Conference | Learning for dynamics and control |
## Key Research Directions (2024-2025)
1. **RLHF / RLAIF**: RL from human or AI feedback for LLM alignment
2. **Offline RL**: Learning from pre-collected datasets without environment interaction
3. **Foundation models for control**: Using pre-trained LLMs/VLMs as world models or planners
4. **Multi-agent RL**: Cooperative and competitive settings with communication
5. **Safe RL**: Constrained optimization to ensure safety during training and deployment
6. **Sample-efficient RL**: Reducing the gap between model-free and model-based sample complexityRelated Skills
thuthesis-guide
Write Tsinghua University theses using the ThuThesis LaTeX template
thesis-writing-guide
Templates, formatting rules, and strategies for thesis and dissertation writing
thesis-template-guide
Set up LaTeX templates for PhD and Master's thesis documents
sjtuthesis-guide
Write SJTU theses using the SJTUThesis LaTeX template with full compliance
novathesis-guide
LaTeX thesis template supporting multiple universities and formats
graphical-abstract-guide
Create SVG graphical abstracts for journal paper submissions
beamer-presentation-guide
Guide to creating academic presentations with LaTeX Beamer
plagiarism-detection-guide
Use plagiarism detection tools and ensure manuscript originality
paper-polish-guide
Review and polish LaTeX research papers for clarity and style
grammar-checker-guide
Use grammar and style checking tools to polish academic manuscripts
conciseness-editing-guide
Eliminate wordiness and redundancy in academic prose for clarity
academic-translation-guide
Academic translation, post-editing, and Chinglish correction guide