code-llm-papers-guide

Survey and paper collection on LLMs for code generation

191 stars

Best use case

code-llm-papers-guide is best used when you need a repeatable AI agent workflow instead of a one-off prompt.

Survey and paper collection on LLMs for code generation

Teams using code-llm-papers-guide should expect a more consistent output, faster repeated execution, less prompt rewriting.

When to use this skill

You want a reusable workflow that can be run more than once with consistent structure.

When not to use this skill

You only need a quick one-off answer and do not need a reusable workflow.
You cannot install or maintain the underlying files, dependencies, or repository context.

Installation

Claude Code / Cursor / Codex

$curl -o ~/.claude/skills/code-llm-papers-guide/SKILL.md --create-dirs "https://raw.githubusercontent.com/wentorai/research-plugins/main/skills/domains/cs/code-llm-papers-guide/SKILL.md"

Manual Installation

Download SKILL.md from GitHub
Place it in .claude/skills/code-llm-papers-guide/SKILL.md inside your project
Restart your AI agent — it will auto-discover the skill

How code-llm-papers-guide Compares

Feature / Agent	code-llm-papers-guide	Standard Approach
Platform Support	Not specified	Limited / Varies
Context Awareness	High	Baseline
Installation Complexity	Unknown	N/A

Frequently Asked Questions

What does this skill do?

Survey and paper collection on LLMs for code generation

Where can I find the source code?

You can find the source code on GitHub using the link provided at the top of the page.

Related Guides

Best AI Skills for Claude

Explore the best AI skills for Claude and Claude Code across coding, research, workflow automation, documentation, and agent operations.

ChatGPT vs Claude for Agent Skills

Compare ChatGPT and Claude for AI agent skills across coding, writing, research, and reusable workflow execution.

SKILL.md Source

# Code LLM Papers Guide

## Overview

This curated collection covers LLMs for code — from foundational models (Codex, CodeGen, StarCoder) through code generation, completion, repair, translation, and understanding. Accompanies a TMLR survey paper providing systematic categorization. Tracks 500+ papers across pre-training, fine-tuning, evaluation, and application of code-focused language models.

## Taxonomy

```
Code LLMs
├── Pre-training
│   ├── Encoder-only (CodeBERT, GraphCodeBERT)
│   ├── Decoder-only (Codex, CodeGen, StarCoder, DeepSeek-Coder)
│   └── Encoder-Decoder (CodeT5, PLBART)
├── Fine-tuning & Alignment
│   ├── Instruction tuning (WizardCoder, Magicoder)
│   ├── RLHF for code (CodeRL)
│   └── Self-play (AlphaCode)
├── Applications
│   ├── Code generation (NL → Code)
│   ├── Code completion (infilling)
│   ├── Code repair (bug fixing)
│   ├── Code translation (language conversion)
│   ├── Code summarization (Code → NL)
│   ├── Test generation
│   └── Code review
└── Evaluation
    ├── Benchmarks (HumanEval, MBPP, SWE-bench)
    ├── Metrics (pass@k, CodeBLEU)
    └── Security analysis
```

## Key Models Timeline

| Model | Year | Organization | Parameters | Key Innovation |
|-------|------|-------------|------------|----------------|
| **CodeBERT** | 2020 | Microsoft | 125M | Bimodal NL-PL pre-training |
| **Codex** | 2021 | OpenAI | 12B | GPT-3 fine-tuned on GitHub |
| **AlphaCode** | 2022 | DeepMind | 41B | Competitive programming |
| **StarCoder** | 2023 | BigCode | 15B | Fill-in-the-middle, 1T tokens |
| **CodeLlama** | 2023 | Meta | 34B | Llama 2 + code specialization |
| **DeepSeek-Coder** | 2024 | DeepSeek | 33B | 2T token project-level training |
| **Qwen2.5-Coder** | 2024 | Alibaba | 32B | 5.5T tokens, multi-language |

## Benchmark Tracking

```python
# Track model performance on HumanEval
humaneval_scores = {
    "GPT-4": {"pass_at_1": 67.0, "pass_at_10": 86.0},
    "Claude 3.5 Sonnet": {"pass_at_1": 64.0},
    "DeepSeek-Coder-33B": {"pass_at_1": 56.1},
    "CodeLlama-34B": {"pass_at_1": 48.8},
    "StarCoder2-15B": {"pass_at_1": 46.3},
    "GPT-3.5-Turbo": {"pass_at_1": 48.1},
}

print(f"{'Model':<25} {'pass@1':>8} {'pass@10':>8}")
print("-" * 43)
for model, scores in sorted(
    humaneval_scores.items(),
    key=lambda x: x[1].get("pass_at_1", 0),
    reverse=True,
):
    p1 = scores.get("pass_at_1", "—")
    p10 = scores.get("pass_at_10", "—")
    print(f"{model:<25} {str(p1):>8} {str(p10):>8}")
```

## Research Directions

```markdown
### Active Areas (2024-2025)
1. **Repository-level generation** — Understanding full codebases
2. **Agentic coding** — LLMs using tools (debugger, terminal)
3. **Formal verification** — Proving correctness of generated code
4. **Multi-language** — Cross-language transfer and translation
5. **Security** — Detecting and avoiding vulnerable code
6. **Long context** — Processing large codebases (100k+ tokens)
7. **Code editing** — Natural language instructions for code changes
```

## Paper Search

```python
import arxiv

def find_code_llm_papers(topic="code generation", max_results=20):
    """Find recent Code LLM papers on arXiv."""
    query = f"abs:{topic} AND (abs:large language model OR abs:LLM)"

    search = arxiv.Search(
        query=query,
        max_results=max_results,
        sort_by=arxiv.SortCriterion.SubmittedDate,
    )

    for result in search.results():
        print(f"[{result.published.strftime('%Y-%m-%d')}] "
              f"{result.title}")

find_code_llm_papers("code generation")
find_code_llm_papers("automated program repair")
```

## Use Cases

1. **Literature survey**: Map the Code LLM research landscape
2. **Model selection**: Compare code models for specific tasks
3. **Benchmark analysis**: Track state-of-the-art on standard benchmarks
4. **Research planning**: Identify open problems and trends
5. **Course material**: Teach software engineering + AI intersection

## References

- [Awesome-Code-LLM](https://github.com/codefuse-ai/Awesome-Code-LLM)
- [TMLR Survey Paper](https://arxiv.org/abs/2311.07989)
- [HumanEval](https://github.com/openai/human-eval)
- [SWE-bench](https://www.swebench.com/)