Featured Project // 03
プロジェクト
SUPER MARIO BOT V3
Rainbow DQN learns to play NES Mario
A two-part AI system that teaches a neural network to play Super Mario Bros on NES. A Lua script inside the FCEUX emulator controls frame progression, reads memory, and executes inputs. A Python trainer on the GPU side runs a full Rainbow DQN with experience replay, reward calculation, and episode management.
The two halves communicate over WebSocket, passing JSON game state and control payloads back and forth every 4 frames. The agent learns a distance-based reward system with stuck detection and escalating penalties.
This is V3. Versions 1 and 2 taught me what not to do. V3 implements all six components of the Rainbow architecture and actually makes progress through World 1-1.
FCEUX Emulator (Lua)
mario_ai.lua reads NES memory for Mario's position, enemy locations, and game state. Advances frames and executes controller inputs received from the trainer.
Python Trainer (GPU)
Receives game state, captures and preprocesses 84x84 grayscale frames into 4-frame stacks, feeds them through the Rainbow DQN, and sends back the chosen action.
Reward Calculation
Distance-based rewards: +1.0/pixel for new max distance, penalties for death and getting stuck. 5-second stuck timeout terminates the episode.
Experience Replay + Learning
Experiences stored in a 50K-slot CPU-based prioritized replay buffer (~10.5GB RAM). Batches of 128 transferred to GPU each training step.
Dueling DQN
Separate value and advantage streams. The network learns which states are valuable independently from which actions matter.
Double DQN
Decoupled action selection and evaluation to reduce Q-value overestimation. Online network picks the action, target network evaluates it.
C51 Distributional
Models the full return distribution with 51 atoms over support [-30, 50] instead of a single expected value. Richer learning signal.
N-step Returns
3-step bootstrapping for faster reward propagation. The agent sees further into the future when computing TD targets.
NoisyNet
State-dependent exploration via learned noise parameters in the linear layers. Replaces naive epsilon-greedy with structured exploration.
Prioritized Replay
Experiences with higher TD error are sampled more frequently. The agent focuses training on the transitions it was most wrong about.