Deep Reinforcement Learning with Double Q-learning

Implementation of the research paper "Deep Reinforcement Learning with Double Q-learning" by Hado van Hasselt, Arthur Guez, and David Silver (DeepMind, 2015).

📄 Paper Summary

The Problem: Overestimation Bias in Q-Learning

Traditional Deep Q-Networks (DQN) suffer from overestimation bias because they use the same network to both:

Select the best action (argmax)
Evaluate that action's value

This leads to systematic overestimation of Q-values, especially in stochastic environments, which can harm learning performance.

The Solution: Double Q-Learning

The paper introduces Double DQN (DDQN), which decouples action selection from action evaluation:

Online Network: Selects the best action for the next state
Target Network: Evaluates the value of that selected action

Key Formula:

Q_target = r + γ * Q_target(s', argmax_a Q_online(s', a))

Instead of DQN's formula:

Q_target = r + γ * max_a Q_target(s', a)

Benefits

✅ Reduces overestimation - More accurate Q-value estimates
✅ Better generalization - Improves performance in noisy/stochastic environments
✅ Same computational cost - No additional overhead compared to DQN
✅ State-of-the-art results - Achieves superior performance on Atari 2600 games

🏗️ Implementation Details

This implementation applies DDQN to Atari Breakout using PyTorch and Gymnasium (OpenAI Gym).

Architecture

Neural Network

Input: 4 stacked grayscale frames (84×84×4)
Conv1: 32 filters, 8×8 kernel, stride 4 → ReLU
Conv2: 64 filters, 4×4 kernel, stride 2 → ReLU
Conv3: 64 filters, 3×3 kernel, stride 1 → ReLU
Flatten
FC1: 512 units → ReLU
FC2: num_actions (4 for Breakout)

Key Components

Preprocessing Pipeline
- FireResetEnv: Auto-launches ball in Breakout (critical!)
- AtariPreprocessing: Grayscale conversion + 84×84 resize
- FrameStackObservation: Stacks 4 consecutive frames for temporal information
Replay Buffer
- Capacity: 100,000 transitions
- Uniform random sampling
- Stores: (state, action, reward, next_state, done)
Training Optimizations
- Reward Clipping: Clips rewards to [-1, +1] for stability
- Gradient Clipping: Clips gradients to [-1, +1] to prevent exploding gradients
- Huber Loss: Smooth L1 loss for robust learning
- Target Network: Updated every 1,000 steps

Hyperparameters

Parameter	Value	Description
Total Frames	5,000,000	Total training steps
Batch Size	32	Minibatch size for learning
Learning Rate	0.0001	Adam optimizer learning rate
Gamma (γ)	0.99	Discount factor
Epsilon Start	1.0	Initial exploration rate
Epsilon End	0.01	Final exploration rate
Epsilon Decay	1,000,000	Frames to decay epsilon
Target Update	1,000	Steps between target network updates
Replay Start	10,000	Frames before learning begins

🚀 Usage

Prerequisites

pip install -r requirements.txt

Requirements:

Python 3.8+
PyTorch
Gymnasium
ALE (Arcade Learning Environment)
OpenCV
NumPy
Matplotlib
imageio

Training

python main.py

Training Progress:

Models saved every 100 episodes → models/ddqn_breakout_{episode}.pth
Training graphs saved every 100 episodes → graphs/training_step_{episode}.png
Gameplay recordings (GIFs) saved every 50 episodes → recordings/episode_{episode}.gif

Estimated Training Time:

CPU: ~24-48 hours for 5M frames
GPU (CUDA): ~4-8 hours

📊 Results

The implementation generates:

Training Graphs (graphs/)
- Episode rewards over time
- 100-episode moving average
- Tracks learning progress
Gameplay Recordings (recordings/)
- High-quality GIFs (upscaled 3x)
- Shows agent's gameplay every 50 episodes
- Greedy policy (ε=0.01) for best performance
Model Checkpoints (models/)
- Saved every 100 episodes
- Can resume training or evaluate later

Directory Structure

DDQN-paper-into-code/
├── main.py              # Main training script
├── requirements.txt     # Python dependencies
├── README.md           # This file
├── graphs/             # Training progress plots
├── recordings/         # Gameplay GIFs
└── models/             # Saved model checkpoints

🧠 Key Implementation Highlights

Double Q-Learning Core (lines 169-172)

with torch.no_grad():
    # Online network selects the best action
    next_actions = self.online_net(next_states).argmax(dim=1, keepdim=True)
    # Target network evaluates that action
    next_q_values = self.target_net(next_states).gather(1, next_actions)
    target_q = rewards + (self.gamma * next_q_values * (~dones))

This is the heart of DDQN - decoupling action selection from evaluation.

Exploration Strategy

Uses ε-greedy with linear decay:

Start: 100% random actions (ε=1.0)
Decay over 1M frames
End: 1% random actions (ε=0.01)

📚 References

Original Paper:

Deep Reinforcement Learning with Double Q-learning
Hado van Hasselt, Arthur Guez, David Silver
DeepMind, AAAI 2016

Related Papers:

Playing Atari with Deep Reinforcement Learning (DQN - Mnih et al., 2013)
Human-level control through deep reinforcement learning (Nature DQN - Mnih et al., 2015)

🎮 Environment

Game: Atari Breakout (NoFrameskip-v4)

Objective: Use a paddle to bounce a ball and break bricks

Action Space: 4 discrete actions

0: NOOP
1: FIRE (launch ball)
2: RIGHT
3: LEFT

State Space: 4 stacked 84×84 grayscale frames

🤝 Acknowledgments

This implementation is based on the seminal work by DeepMind researchers and follows best practices from:

Original DDQN paper
OpenAI Baselines
PyTorch DQN tutorial
Atari preprocessing techniques from DQN literature

📝 License

This project is for educational purposes, implementing the research paper "Deep Reinforcement Learning with Double Q-learning" for learning and demonstration.

🔗 Author

Created as part of learning Deep Reinforcement Learning and implementing research papers into working code.

GitHub Repository: https://github.com/satyammistari/DDQN-paper-into-code

DDQN-paper-into-code

DDQN paper into code