Super Mario Bot V3

PROJECT://OVERVIEW ACTIVE

A two-part AI system that teaches a neural network to play Super Mario Bros on NES. A Lua script inside the FCEUX emulator controls frame progression, reads memory, and executes inputs. A Python trainer on the GPU side runs a full Rainbow DQN with experience replay, reward calculation, and episode management.

The two halves communicate over WebSocket, passing JSON game state and control payloads back and forth every 4 frames. The agent learns a distance-based reward system with stuck detection and escalating penalties.

This is V3. Versions 1 and 2 taught me what not to do. V3 implements all six components of the Rainbow architecture and actually makes progress through World 1-1.

SPECS://TECHNICAL LOADED

STACK Python + PyTorch + Lua

EMULATOR FCEUX 2.6.x (NES)

NETWORK Rainbow DQN (6 components)

INPUT 4x84x84 grayscale + state vector

ACTIONS 12 NES controller combinations

COMMS WebSocket (JSON, port 8765)

STATUS Active Development

LICENSE MIT

ARCHITECTURE://FLOW SYSTEM DIAGRAM MAPPED

01

FCEUX Emulator (Lua)

mario_ai.lua reads NES memory for Mario's position, enemy locations, and game state. Advances frames and executes controller inputs received from the trainer.

↓ WebSocket (JSON) ↓

02

Python Trainer (GPU)

Receives game state, captures and preprocesses 84x84 grayscale frames into 4-frame stacks, feeds them through the Rainbow DQN, and sends back the chosen action.

↓

03

Reward Calculation

Distance-based rewards: +1.0/pixel for new max distance, penalties for death and getting stuck. 5-second stuck timeout terminates the episode.

↓

04

Experience Replay + Learning

Experiences stored in a 50K-slot CPU-based prioritized replay buffer (~10.5GB RAM). Batches of 128 transferred to GPU each training step.

RAINBOW://COMPONENTS ALL 6 MODULES ONLINE

DL

Dueling DQN

Separate value and advantage streams. The network learns which states are valuable independently from which actions matter.

DD

Double DQN

Decoupled action selection and evaluation to reduce Q-value overestimation. Online network picks the action, target network evaluates it.

C51

C51 Distributional

Models the full return distribution with 51 atoms over support [-30, 50] instead of a single expected value. Richer learning signal.

NS

N-step Returns

3-step bootstrapping for faster reward propagation. The agent sees further into the future when computing TD targets.

NN

NoisyNet

State-dependent exploration via learned noise parameters in the linear layers. Replaces naive epsilon-greedy with structured exploration.

PR

Prioritized Replay

Experiences with higher TD error are sampled more frequently. The agent focuses training on the transitions it was most wrong about.

TRAINING://LOG EPISODE OUTPUT RUNNING

> python python/main.py train

[INIT] Rainbow DQN loaded | 12 actions | 51 atoms

[OK] WebSocket server listening on :8765

[OK] FCEUX client connected

[EP 142] reward: 847.3 | max_x: 1248 | eps: 0.42 | loss: 0.0031

[EP 143] reward: 1102.6 | max_x: 1580 | eps: 0.42 | loss: 0.0028

[EP 144] reward: 623.1 | max_x: 944 | stuck_timeout

>

CONFIG://PARAMS KEY HYPERPARAMETERS SET

LEARNING RATE 0.00025 (Adam)

BATCH SIZE 128

REPLAY BUFFER 50,000 (~10.5GB RAM)

GAMMA 0.99

EPSILON DECAY 0.9995/episode (floor at 0.01)

TARGET UPDATE Soft (Polyak tau=0.005)

C51 SUPPORT [-30, 50] over 51 atoms

STUCK TIMEOUT 300 frames (5 seconds)