DDQN-paper-into-code
DDQN paper into code
No description provided on GitHub.
Deep Reinforcement Learning with Double Q-learning
Implementation of the research paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt, Arthur Guez, and David Silver (DeepMind, 2015).
š Paper Summary
The Problem: Overestimation Bias in Q-Learning
Traditional Deep Q-Networks (DQN) suffer from overestimation bias because they use the same network to both:
- Select the best action (argmax)
- Evaluate that action's value
This leads to systematic overestimation of Q-values, especially in stochastic environments, which can harm learning performance.
The Solution: Double Q-Learning
The paper introduces Double DQN (DDQN), which decouples action selection from action evaluation:
- Online Network: Selects the best action for the next state
- Target Network: Evaluates the value of that selected action
Key Formula:
Q_target = r + γ * Q_target(s', argmax_a Q_online(s', a))
Instead of DQN's formula:
Q_target = r + γ * max_a Q_target(s', a)
Benefits
ā
Reduces overestimation - More accurate Q-value estimates
ā
Better generalization - Improves performance in noisy/stochastic environments
ā
Same computational cost - No additional overhead compared to DQN
ā
State-of-the-art results - Achieves superior performance on Atari 2600 games
šļø Implementation Details
This implementation applies DDQN to Atari Breakout using PyTorch and Gymnasium (OpenAI Gym).
Architecture
Neural Network
Input: 4 stacked grayscale frames (84Ć84Ć4)
Conv1: 32 filters, 8Ć8 kernel, stride 4 ā ReLU
Conv2: 64 filters, 4Ć4 kernel, stride 2 ā ReLU
Conv3: 64 filters, 3Ć3 kernel, stride 1 ā ReLU
Flatten
FC1: 512 units ā ReLU
FC2: num_actions (4 for Breakout)
Key Components
-
Preprocessing Pipeline
FireResetEnv: Auto-launches ball in Breakout (critical!)AtariPreprocessing: Grayscale conversion + 84Ć84 resizeFrameStackObservation: Stacks 4 consecutive frames for temporal information
-
Replay Buffer
- Capacity: 100,000 transitions
- Uniform random sampling
- Stores: (state, action, reward, next_state, done)
-
Training Optimizations
- Reward Clipping: Clips rewards to [-1, +1] for stability
- Gradient Clipping: Clips gradients to [-1, +1] to prevent exploding gradients
- Huber Loss: Smooth L1 loss for robust learning
- Target Network: Updated every 1,000 steps
Hyperparameters
| Parameter | Value | Description |
|---|---|---|
| Total Frames | 5,000,000 | Total training steps |
| Batch Size | 32 | Minibatch size for learning |
| Learning Rate | 0.0001 | Adam optimizer learning rate |
| Gamma (γ) | 0.99 | Discount factor |
| Epsilon Start | 1.0 | Initial exploration rate |
| Epsilon End | 0.01 | Final exploration rate |
| Epsilon Decay | 1,000,000 | Frames to decay epsilon |
| Target Update | 1,000 | Steps between target network updates |
| Replay Start | 10,000 | Frames before learning begins |
š Usage
Prerequisites
pip install -r requirements.txt
Requirements:
- Python 3.8+
- PyTorch
- Gymnasium
- ALE (Arcade Learning Environment)
- OpenCV
- NumPy
- Matplotlib
- imageio
Training
python main.py
Training Progress:
- Models saved every 100 episodes ā
models/ddqn_breakout_{episode}.pth - Training graphs saved every 100 episodes ā
graphs/training_step_{episode}.png - Gameplay recordings (GIFs) saved every 50 episodes ā
recordings/episode_{episode}.gif
Estimated Training Time:
- CPU: ~24-48 hours for 5M frames
- GPU (CUDA): ~4-8 hours
š Results
The implementation generates:
-
Training Graphs (
graphs/)- Episode rewards over time
- 100-episode moving average
- Tracks learning progress
-
Gameplay Recordings (
recordings/)- High-quality GIFs (upscaled 3x)
- Shows agent's gameplay every 50 episodes
- Greedy policy (ε=0.01) for best performance
-
Model Checkpoints (
models/)- Saved every 100 episodes
- Can resume training or evaluate later
Directory Structure
DDQN-paper-into-code/
āāā main.py # Main training script
āāā requirements.txt # Python dependencies
āāā README.md # This file
āāā graphs/ # Training progress plots
āāā recordings/ # Gameplay GIFs
āāā models/ # Saved model checkpoints
š§ Key Implementation Highlights
Double Q-Learning Core (lines 169-172)
with torch.no_grad():
# Online network selects the best action
next_actions = self.online_net(next_states).argmax(dim=1, keepdim=True)
# Target network evaluates that action
next_q_values = self.target_net(next_states).gather(1, next_actions)
target_q = rewards + (self.gamma * next_q_values * (~dones))
This is the heart of DDQN - decoupling action selection from evaluation.
Exploration Strategy
Uses ε-greedy with linear decay:
- Start: 100% random actions (ε=1.0)
- Decay over 1M frames
- End: 1% random actions (ε=0.01)
š References
Original Paper:
- Deep Reinforcement Learning with Double Q-learning
- Hado van Hasselt, Arthur Guez, David Silver
- DeepMind, AAAI 2016
Related Papers:
- Playing Atari with Deep Reinforcement Learning (DQN - Mnih et al., 2013)
- Human-level control through deep reinforcement learning (Nature DQN - Mnih et al., 2015)
š® Environment
Game: Atari Breakout (NoFrameskip-v4)
Objective: Use a paddle to bounce a ball and break bricks
Action Space: 4 discrete actions
- 0: NOOP
- 1: FIRE (launch ball)
- 2: RIGHT
- 3: LEFT
State Space: 4 stacked 84Ć84 grayscale frames
š¤ Acknowledgments
This implementation is based on the seminal work by DeepMind researchers and follows best practices from:
- Original DDQN paper
- OpenAI Baselines
- PyTorch DQN tutorial
- Atari preprocessing techniques from DQN literature
š License
This project is for educational purposes, implementing the research paper "Deep Reinforcement Learning with Double Q-learning" for learning and demonstration.
š Author
Created as part of learning Deep Reinforcement Learning and implementing research papers into working code.
GitHub Repository: https://github.com/satyammistari/DDQN-paper-into-code