DeepSeek-R1- Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

tags#paper, #resource, #modern_ai/RL4LLM

multi-stage training + cold-start data before RL

employ GRPO

R1-Zero

R1

V3-Base

DeepSeek R1-Zero

R1-Zero naturally learns to solve increasingly complex reasoning tasks by leveraging extended test-time computation
img-20250804104434587

Emergence of sophisticated behaviors as the test-time computation increases

Reward: rule-based reward system

  1. accuracy rewards
    • math problems with deterministic results
    • LeetCode problems with predefined test cases that can be tested by a compiler
  2. format rewards
    • reasoning process and answer should be enclosed within <think> </think> and <answer> </answer> tags, respectively,

Drawbacks: struggles with challenges like poor readability, and language mixing

DeepSeek-R1: RL with Cold Start

  1. Cold start
    • collect thousands of long CoT data to fine-tune the model as initial RL actor
    • data collection:
      • few-shot prompting with a long CoT as an example,
      • directly prompting models to generate model detailed answers with reflection and verification
      • gathering R1-Zero outputs in a readable format and post-processing by human annotators
    • benefit: readability and better performance
  2. Reasoning-oriented RL
    • train by reasoning-intensive tasks such as coding, mathematics, science, and logic reasoning, which involve well-defined problems with clear solutions
    • language consistency reward
  3. Rejection Sampling and Supervised Fine-Tuning
    • check stage 2’s RL converges, use the checkpointing model to collect SFT data
    • reasoning data
    • non-reasoning data
  4. RL for all Scenarios
    • a secondary RL stage aiming at helpfulness and harmlessness (alignment?)

Unsuccessful Attempts

Distillation