DQN Maze Solver

In the previous tutorial I showed how to build a maze-solving Q-learner using a table-based Q-learning method. In this post, I show how to solve the same maze using DQN (Deep Q-Learning). The code for this example program is found here.

While the table-based approach is ideal for small state spaces (869 bytes for 6 states), DQN replaces the Q-table with a neural network, enabling reinforcement learning in environments with hundreds or thousands of states where a Q-table would be prohibitively large. The entire DQN implementation here – neural network, Q-learning logic, and lookup tables – fits in under 16 KB. In standard frameworks like PyTorch or TensorFlow, a DQN of any size requires tens of megabytes of runtime before the first weight is allocated.

Reinforcement Learning

In a reinforcement learning problem, an Agent interacts with an Environment by evaluating its State, taking Actions, and receiving Rewards. The goal is to learn which Actions provide the most Reward over the course of time.

qlearn

The Maze Problem

As in the previous tutorial, we’re going to attempt to solve the maze problem. A mouse gets dropped randomly into 1 of 6 places on the maze and must learn to find the cheese (reward).

maze

DQN Architecture

For this problem we will get rid of the Q-table and replace it with a neural network. We will need 1 input neuron (State) and 6 output neurons (Actions). I’ve chosen to implement 1 hidden layer with 2 more neurons than output neurons.

Each output neuron represents the action to take from the state. There is 1 neuron in the input layer, which represents the current state of our mouse. Using Q-format, I scale the input by taking the current state and dividing by the maximum possible state:

template<typename StateType,
         typename ActionType,
         typename ValueType,
         size_t NumberOfStates,
         size_t NumberOfActions,
         typename RewardPolicyType,
         template<typename> class QLearningPolicy = tinymind::DefaultLearningPolicy
         >
struct DQNMazeEnvironment : public tinymind::QLearningEnvironment<state_t, action_t, ValueType, NumberOfStates, NumberOfActions, DQNMazeEnvironmentRandomNumberGeneratorPolicy>
{
...
static void getInputValues(const StateType state, ValueType *pInputs)
{
static const ValueType MAX_STATE = ValueType((NumberOfStates - 1),0);
ValueType input = (ValueType(state, 0) / MAX_STATE);
*pInputs = input;
}

DQN Neural Network

The neural network is trained to predict the Q-value for every action taken from the given state. After the neural network is trained, we then choose the action which contains the highest Q-value. See the Neural Networks page and the Under 4KB tutorial for more insight into how these neural networks work.

dqn_nn

The DQN is defined within dqn_mazelearner.h.

Building The Example

cd examples/dqn_maze
make

Run the example:

make run

Visualizing Training And Testing

I have included a Python script to plot the training and test data.

Training data for start state == 2:

dqn state 2

After training, the Q-learner has learned an optimal path: 2->3->1->5:

dqn state 2 final

Determining The Size Of The DQN

make release
size output/dqn_mazelearner.o

   text	   data	    bss	    dec	    hex	filename
  10979	      8	   3652	  14639	   392f	output/dqn_mazelearner.o

The entire DQN fits within ~16 KB. Pretty small as DQN goes.

Conclusion

DQN trades out the Q-table for a neural network to learn the relationship between states, actions, and Q-values. One would want to use DQN when the state space is large and the memory consumed by the Q-table would be prohibitively large. By using DQN, we’re trading memory for CPU cycles. The CPU overhead for DQN will be far larger than a Q-table. But, DQN allows us to do Q-learning while keeping our memory footprint manageable for complex environments.