Deep Reinforcement Learning to play Flappy Bird using A3C algorithm

Overview

This project uses Asynchronous advantage actor-critic algorithm (A3C) to play Flappy Bird using Keras deep learning library. The details of this algorithm are mentioned in this paper by Google DeepMind. The code for this project can be found in this GitHub repository.

Installation Dependencies

Python 3.5
Keras 2.0
pygame
scikit-image

How to Run?

Clone the repository or download it. To test the pre-trained model, run the test.py file. To retrain the model from scratch, run the train_network.py script.

The trained models at different stages will be saved in saved_models folder.

What is Deep Reinforcement Learning?

Deep Reinforcement learning is a technique that uses deep neural networks to solve Reinforcement learning problems. Reinforcement learning is a machine learning method that lies somewhere between supervised and unsupervised learning. Whereas in supervised learning one has a target label for each training example and in unsupervised learning one has no labels at all, in reinforcement learning one has sparse and time-delayed labels – the rewards. Based on these rewards the agent has to learn to behave in the given environment.

The following post is a must read for good introduction to Deep Reinforcement Learning - Demystifying Deep Reinforcement Learning

Why A3C?

Till early 2016, Deep Q-learning was the go to method for playing games through reinforcement learning. Deep Q-learning was proposed by DeepMind back in 2013. Deep Q-learning was the first RL algorithm which was able to play games successfully and it became so popular that Google bought DeepMind itself. In february 2016, Google DeepMind proposed another set of algorithms for playing atari games, which were better than Deep Q-learning. The most successful of those algorithms is known as Asynchronous advantage actor-critic algorithm (A3C). The advantages that A3C has over Deep Q-learning are as follows-

Deep Q learning has a very large training time (~1 week on a GPU) whereas basic A3C takes 1 day to train on a CPU. (training time for Flappy Bird game in this project is barely 6 hours on a CPU!!)
Deep Q learning uses experience replay for getting good convergence, which requires a lot of memory. A3C use multiple threads for this purpose which eliminates huge memory requirement.
Deep Q learning is an off-policy learning algorithm that can update from data generated by an older policy (stored in experience replay), even though a better policy may have been discovered later. On the other hand, A3C is an on-policy method in which updates are made from data generated from current policy only.

However because of better exploration by DQN, it generally settles at global minima whereas A3C might settle at a local minima leading to sub-optimal results.

A complete blog post can be written on Q-learning itself so as to explain technical terms like experience replay etc., however I won’t delve into those topics. For knowing more about Deep Q learning and using it to play Flappy bird, see this excellent blog post by Ben Lau - Using Keras and Deep Q-Network to Play FlappyBird. A rough understanding of Deep Q-learning approach is a must for proceeding further.

Off policy v/s On policy methods

In the above few lines I have repeatedly mentioned about Off policy and On policy methods so I will explain what they are. If the updates in the neural network model performed by a method is partially/completely independent of the policy (actions taken), then it is an off policy method. Else, it is an on policy method.

Asynchronous algorithms for Deep Reinforcement Learning

The new approach to solve Reinforcement Learning problems involve the use of asynchronous variants of standard Reinforcement Learning algorithms. Instead of experience replay, multiple agents are executed in parallel asynchronously, on multiple instances of the environment. This parallelism also decorrelates the agent’s data into a more stationary process, since at any given time-step the parallel agents will be experiencing a variety of different states. This simple idea enables a much larger spectrum of fundamental on-policy RL algorithms, such as Sarsa, n-step methods, and actor-critic methods, as well as off-policy RL algorithms such as Q-learning, to be applied robustly and effectively using deep neural networks. A brief introduction to these algorithms is as follows-

Asynchronous one step Q-learning - This is an off policy learning algorithm as the updates it makes are independent of the actions taken because we assume that action with maximum Q-value will be taken in the next step. It is exactly similar to standard Deep Q learning, other than the fact that it uses asynchronous parallel agents rather that experience replay for getting statistically independent inputs for making updates to the neural network.
Asynchronous one-step SARSA - It is same as asynchronous one step Q-learning except the fact that it uses an on policy approach to perform updates. In general one step SARSA leads to better performance that one step Q learning but it may well depend on case to case.
Asynchronous n-Step Q-learning - It is same as asynchronous one step Q-learning except the fact that it uses up to n steps in future to calculate the expected reward in the present step because of which updates in policy are transferred back to earlier states quickly.
Asynchronous advantage actor-critic - Asynchronous advantage actor-critic (A3C) is the fastest algorithm to train, among all. It has 2 versions, one that uses a Feed Forward Fully connected layer in the end of the network and another one that uses a LSTM (Long Short term memory) layer. We will use the first variant over here. This uses a policy based approach along with a critic which promotes or criticises any action taken, depending upon how much the actual reward is in comparison to the expected for that state.

Code Explanation

Now let us go through the code line by line. The code simply does the following:

The code receives the Game Screen Input in the form of a pixel array.
The code does some image pre-processing to convert the image into required format for feeding to the neural network.
The processed image will be fed into a Convolutional neural network, and the network will then decide the best action to execute (Flap or Not Flap)

The network will be trained for a lot of times, to maximize the future expected reward.

Game Screen Input

First of all, the Flappy Bird is already written in Python via pygame, so here is the code snippet to access the FlappyBird API

import wrapped_flappy_bird as game
x_t, r_t, terminal = game_state[thread_id].frame_step(a_t)

The idea is quite simple, the input is a_t (0 represent no flap, 1 represent flap) and the API will give you the next frame x_t, the reward r_t (0.1 if alive, +1.5 if it passes the pipe, -1 if it dies) and terminal is a boolean flag which indicates whether the game is FINISHED or NOT. As there are multiple threads in which the game is running, ‘thread_id’ denotes the thread of the running game.

Interested readers can modify the reward function in “game/wrapped_flappy_bird.py”, under the function def frame_step(self, input_actions) in the github repository.

Image pre-processing

In order to make the code train faster, it is vital to do some image processing. Here are the key elements:

Convert the color image into grayscale.
Crop down the image size into 84x84 pixel size as suggested in the paper.
Stack 4 frames together before feeding into neural network.

Why is it needed to stack 4 frames together? This is one way for the model to be able to infer the movement of the bird.

def preprocess(image):
	image = skimage.color.rgb2gray(image)
	image = skimage.transform.resize(image, (IMAGE_ROWS,IMAGE_COLS), mode = 'constant')	
	image = skimage.exposure.rescale_intensity(image, out_range=(0,255))
	image = image.reshape(1, image.shape[0], image.shape[1], 1)
	return image

The input to the preprocess function is a single frame image, which is resized to the size 84x84 (as IMAGE_ROWS = IMAGE_COLS = 84). Later on, 4 such images are stacked onto each other and then given as input to the neural network.

Model Definition

Now as we have processed the input into the model, we can describe the model itself now-

def buildmodel():
	print("Model buliding begins")
	model = Sequential()
	keras.initializers.RandomUniform(minval=-0.1, maxval=0.1, seed=None)

	S = Input(shape = (IMAGE_ROWS, IMAGE_COLS, IMAGE_CHANNELS, ), name = 'Input')
	h0 = Convolution2D(16, kernel_size = (8,8), strides = (4,4), activation = 'relu', kernel_initializer = 'random_uniform', bias_initializer = 'random_uniform')(S)
	h1 = Convolution2D(32, kernel_size = (4,4), strides = (2,2), activation = 'relu', kernel_initializer = 'random_uniform', bias_initializer = 'random_uniform')(h0)
	h2 = Flatten()(h1)
	h3 = Dense(256, activation = 'relu', kernel_initializer = 'random_uniform', bias_initializer = 'random_uniform') (h2)
	P = Dense(1, name = 'o_P', activation = 'sigmoid', kernel_initializer = 'random_uniform', bias_initializer = 'random_uniform') (h3)
	V = Dense(1, name = 'o_V', kernel_initializer = 'random_uniform', bias_initializer = 'random_uniform') (h3)

	model = Model(inputs = S, outputs = [P,V])
	rms = RMSprop(lr = LEARNING_RATE, rho = 0.99, epsilon = 0.1)
	model.compile(loss = {'o_P': binarycrossentropy, 'o_V': sumofsquares}, loss_weights = {'o_P': 1., 'o_V' : 0.5}, optimizer = rms)
	return model

The exact architecture is following: The input to the neural network consists of 84 x 84 x 4 images. The first hidden layer convolves 16 filters of 8x8 with stride 4 and applies ReLU activation function. The 2nd layer convolves 32 filters of 4 x 4 with stride 2 and applies ReLU activation function. All the layers in second layer are then put side by side (flattened) and a final hidden layer, which is fully-connected consisting of 256 ReLU units is added upon it. The output layer following the final hidden layer has two types of output-

Policy Output : This output contains one neuron for each possible action. The value in each of these neurons indicates the probability of that particular action to be taken. The sum of all the outputs of this type is hence 1. However in this game, as there are only 2 possible actions, we will use only one output unit indicating the probability of flap. The probability of not flapping is then one minus the probability of flapping. Sigmoid activation is used for this output so as to restrict the output value in valid range of probabilities [0, 1].
Critic Output : The second type of output is critic output. For each state input to the neural network, the network also outputs the expected reward from these states. The expected reward value acts a feedback to the network. If an action results in a reward higher than that predicted by the critic, then the weights of the neural network are tweaked in a manner so that this action is promoted, the next time a similar state occurs. Similarly, actions yielding less reward than that predicted by the critic are less likely to occur in future. This is the method that A3C uses to promote higher reward actions in the future.

Multithreading

Finally, now as our model is ready we are ready to define parallel threads for updates. In this project, 16 parallel threads have been defined, each of which runs until terminal is true (bird died) or until $t_{max}$ steps have been performed, before weight updates are backpropagated. The value of $t_{max}$ used is 5 steps and hence the maximum number of inputs in each batch is 16 x 5 = 80.

class actorthread(threading.Thread):
	def __init__(self,thread_id, s_t, FIRST_FRAME, t):
		threading.Thread.__init__(self)
		self.thread_id = thread_id
		self.s_t = s_t
		self.FIRST_FRAME = FIRST_FRAME
		self.t = t

	def run(self):
		global episode_output
		global episode_r
		global episode_critic
		global episode_state

		threadLock.acquire()
		self.t, state_store, output_store, r_store, critic_store, self.FIRST_FRAME = runprocess(self.thread_id, self.s_t, self.FIRST_FRAME, self.t)
		self.s_t = state_store[-1]
		self.s_t = self.s_t.reshape(1, self.s_t.shape[0], self.s_t.shape[1], self.s_t.shape[2])

		episode_r = np.append(episode_r, r_store)
		episode_output = np.append(episode_output, output_store)
		episode_state = np.append(episode_state, state_store, axis = 0)
		episode_critic = np.append(episode_critic, critic_store)

		threadLock.release()

A class actorthread is defined which maintains information about each thread. Each thread has a thread_id, and is initialized with an initial state s_t (s_t contains 4 stacked game frames) to be given as input to the neural network. FIRST_FRAME is a boolean variable which indicates if we are going to start playing a new game. This is because every time the bird dies and the game ends, it restarts. Function threadLock.acquire() is used to force the threads to run synchronously and threadLock.release() is used to release the lock when it is no longer required (all threads have finished execution).

Each thread comes to action when it calls runprocess function. The major components of this function are mentioned below-

def runprocess(thread_id, s_t, FIRST_FRAME, t):
	global T
	global a_t
	global model

	while t-t_start < t_max and terminal == False:
		t += 1
		T += 1
		intermediate_output = 0
		
		if FIRST_FRAME == False:
			with graph.as_default():
				out = model.predict(s_t)[0]			
			no = np.random.rand()
			a_t = [0,1] if no < out else [1,0]  #stochastic action
			#a_t = [0,1] if 0.5 <y[0] else [1,0]  #deterministic action

		x_t, r_t, terminal = game_state[thread_id].frame_step(a_t)
		x_t = preprocess(x_t)

		if FIRST_FRAME:
			s_t = np.concatenate((x_t, x_t, x_t, x_t), axis=3)
			FIRST_FRAME = False
		else:
			s_t = np.append(x_t, s_t[:, :, :, :3], axis=3)

		y = 0 if a_t[0] == 1 else 1
		
		with graph.as_default():
			critic_reward = model.predict(s_t)[1]

		r_store = np.append(r_store, r_t)
		state_store = np.append(state_store, s_t, axis = 0)
		output_store = np.append(output_store, y)
		
	if terminal == False:
		r_store[len(r_store)-1] = critic_store[len(r_store)-1]
	else:
		r_store[len(r_store)-1] = -1
		FIRST_FRAME = True
	
	for i in range(2,len(r_store)+1):
		r_store[len(r_store)-i] = r_store[len(r_store)-i] + GAMMA*r_store[len(r_store)-i + 1]

	return t, state_store, output_store, r_store, critic_store, FIRST_FRAME

The runprocess function, starts with defining the while loop in which each frame is processed. For each frame the action chosen is flap with a probability equal to the policy output (first output type). However if it is the first frame of a new game, the action chosen is not to flap, by default. This is done because we still cannot stack 4 consecutive frames in s_t to give as input to the network. Hence all the 4 frames in s_t are taken to be same as the first frame and default action of no flap is chosen as the network doesn’t have any knowledge of bird movement from those frames.

The thread runs for a maximum of $t_{max}$ steps and at each step, the state, action taken, reward obtained at each step and expected reward predicted by critic networks are stored in arrays- state_store, output_store, r_store and critic_store respectively. Later, these arrays for each thread are concatenated and then send to the model for training. The actual discounted reward value for each frame in the thread is calculated by rewards obtained in future steps using the following formula-

$r(s_t) = r(s_t) + \gamma * r(s_{t+1})$

where $s_{t+1}$ is the state succeeding the current state. However, the discounted reward value for the final step of a thread is taken to be the same as the reward predicted by the critic network.

Model Description

I hope that the model definition and thread configuration is clearly understood by now. Now we will discuss about the loss function and the hyperparameters used by the model. As per the sugge stion given by the paper the following loss function is used by the network-

$L_{policy} = \frac{1}{n} \sum_{i=1}^{n} \log \pi(a_t|s_t; \theta) A(s_t, a_t;\theta , {\theta}_v )$

where n is the batch size, $\theta$ and ${\theta}_v$ are the network parameters and $A(s_t, a_t;\theta , {\theta}_v )$ is the advantage function, which is the difference between the actual reward obtained and reward predicted by the critic network. As it can be observed, advantage can both be positive as well as negative. When the advantage is positive, it means that the policy taken is better than expected and hence a positive advantage proportional to the goodness of the policy is multiplied to the loss. When the advantage is negative, it means that the policy is not good and hence a negative advantage is multiplied, so that the contribution to weight updates performed by these states is reversed and the network will adopt this policy with a lesser probability, the next time a similar state is encountered.

The loss for the critic network is-

$L_{critic} = \frac{1}{2n} \sum_{1}^{n} {(r(s_t)-V(s_t;{\theta}_v))}^{2}$

where $r(s_t)$ is the actual reward and $V(s_t;{\theta}_v)$ is the predicted critic reward.

RMSprop optimizer is used for updating the weights of the network. The learning rate used is 7e-4 which is decreased by 3.2e-8 in each epoch until it reaches 0. Other hyperparameters are mentioned below-

No. of threads = 16
Frames/thread used for each update = 5
Reward discount (gamma) = 0.99
RMSProp cache decay rate = 0.99
Entropy regularization, as suggested by the paper has not been used.

The best model I have got is quite good. The bird rarely dies and is able to cross >500 pipes easily. As the performance was pretty good, I didn’t used entropy regularization as suggested by the paper.

The main aim of this blog post is to give a gentle introduction to Deep Reinforcement Learning and to demonstrate an implementation of a game AI using A3C algorithm. I hope that it was a good learning experience. If you liked the post or have any doubts/suggestions please mention in the comments below.

Disclaimer

This work is based on the following repos and blogs-

Written on June 23, 2017