Background vector created by freepik – www.freepik.com

## What is Reinforcement Learning?

Reinforcement learning is the branch of machine learning in which the system learns by trying to maximise the reward gained by taking an action or a series of actions and minimising the “punishment” given to it. In the most basic sense, we reward the system (called the agent) if it does what we want it to do and punish it if it does not. In this way, the agent correlates each series of actions it undertakes to a certain degree of reward or punishment and hence opts for the series of actions that returns the highest cumulative reward. In the training phase, the agent uses a strategy of exploration vs exploitation which is a form of a trial-and-error to try and see if greater magnitudes of rewards can be obtained versus strengthening already known paths to the highest known reward.

## What is Gridworld? How Does an Agent Find the Best Path to Win in a Gridworld?

In a Gridworld game, the space in which the agent can move is in the form of a grid. In this post, we shall use a 4×3 grid. Also, there are certain coordinates or “states” into which the agent cannot move. We shall use one such block in our game. Also, an agent cannot exit the grid by moving into states that are out of bounds and do not exist. The agent always starts at state (0, 0) and ends at either (2, 3), in which case it wins, or (1, 3), in which case it loses. The rewards for the win state and lose state respectively are +1 and -1. After the agent reaches either of these states, the game resets and the agent goes back to the start. The rewards are then back-propagated through the path of states which the agent took to reach the end. The back-propagation of rewards is calculated by the following equation:

Let’s try to understand this with an example. The grid looks like:

And the values of each of the states are initialised to 0.0, except for (1, 1) which is the block, (1, 3) which is the loss state and (2, 3) which is the win state. The loss state has a value of -1 and the win state has a value of +1. We store the states and their values as key value pairs in a dictionary, where the states are represented as a tuple of two integers and the values are floats. Also, if you use a dynamically typed language, you can also easily store the value of the block state as “BLK”. However, 0 works too since the block cannot be visited by the agent. The resulting grid of values looks like:

Note that the values of BLK, win state and loss state should not change. In this example, we shall assume that the agent takes the path (0, 0); (1, 0); (2, 0); (2, 1); (2, 2) and finally reaches win state (2, 3). Now we come to the backpropagation of rewards. We store the states visited by the agent in a list and then reverse the list. For each state in the reversed list, we calculate its value using the formula above. In this example, we shall keep the learning rate (alpha) as 0.2. The value of (2, 3) is already +1.0 so we take the next state, that is, (2, 2). The value of (2, 2) can be calculated as [0.0 + 0.2(1.0 – 0.0)], which is 0.2. We then update the value of (2, 2) and the resultant grid values becomes:

We then take the next state, (2, 1) and calculate its value as [0.0 + 0.2(0.2 – 0)]. The resultant grid becomes:

The value of the next state (2, 0) is calculated as [0.0 + 0.2(0.04 – 0)] and the resultant grid becomes:

Similarly, we calculate the values of (1, 0) and (0, 0) and the grid becomes:

After completing the back-propagation of rewards, the agent is said to have completed a run and returns to coordinate (0, 0) to move to random states (exploration) or to move to the state next to it with the highest value (exploitation). We make the agent complete a large number of runs to increase or decrease the values of states by larger amounts. At the end of 50 runs, the grid might look like:

NOTE: The values of the grid can change every time the program is freshly re-run as the exploration vs exploitation balance is decided by the generation of pseudo-random numbers.

Therefore, we see that if the agent uses a greedy algorithm to decide its actions, it would take the path (0, 0); (1, 0); (2, 0); (2, 1); (2, 2); (2, 3). This is the best path it could possibly take in order to reach the win state. Think of the agent as Guy Pearce’s character, Leonard Shelby, in the Christopher Nolan directed movie Memento (an absolute masterpiece, by the way) – the agent treats each state as a photograph and labels each “photograph” with an integer. Positive integers with greater values are interpreted by the agent to have highest returns, and negative integers the lowest. Through a series of trials-and-errors, the agent assigns higher values to the states leading to the win state and lower values to the states leading to the loss state. Hence, through reinforcement, the agent has learnt to follow the path which returns the highest reward.

## Constructing a Reinforcement Learning Model in Python

First, we create a global variable “grid” which is a dictionary that will store the state-value pairs. We initialise all the values as 0.0 and later change the win state, loss state and block state values to +1, -1 and BLK (or leave it as 0.0) respectively. A list of actions that the agent can take (up, down, left, right) is included in the variables. We then create a class for the agent in which we store all the variables and the functions the agent must do. Among these, the main functions are the possibleMove(), chooseAction() and the play() functions. The possibleMove() function takes an action as its parameter and checks whether the agent can move in the direction specified by the action and if so, returns the new state the agent will find itself in if it takes that action. The chooseAction() function is where the exploration or exploitation undertaken by the agent is decided – we generate a random number and if it is lower than a certain constant, it explores, otherwise, it exploits. The function returns the action chosen by the agent. The play() function is where the agent is trained – if the agent is located at one of the two ends, it back-propagates the reward and completes a run, otherwise, it tells the agent to move again. We also have a number of other functions that help the agent move, let it take the reward, check if the agent has reached the end, resets the game, and reverse the list of states. Finally, we have a function that prints out the grid values. The code for programming the RL model is given below:

``````import random

grid = {}
for i in range(3):
for j in range(4):
grid[(i, j)] = 0.0

class AI:
def __init__(self):
self.WinCoordinate = (2, 3)
self.LoseCoordinate = (1, 3)
self.StartCoordinate = (0, 0)
self.Block = (1, 1)
grid[(2, 3)] = 1.00
grid[(1, 3)] = -1.00
grid[(1, 1)] = "BLK"
self.position = (0, 0)
self.gameOver = False
self.reward = 0.0
self.positionsList = []
self.actionsList = ["up", "left", "down", "right"]
self.alpha = 0.2

def takeReward(self, pos):
if pos == self.WinCoordinate:
return 1.0
elif pos == self.LoseCoordinate:
return -1.0
else:
return 0.0

def reset(self):
self.gameOver = False
self.position = (0, 0)
self.positionsList = []

def isGameOver(self):
if self.position == self.WinCoordinate or self.position == self.LoseCoordinate:
self.gameOver = True

def possibleMove(self, action):
global possibleposition
if action == "up":
possibleposition = (self.position[0] - 1, self.position[1])
elif action == "down":
possibleposition = (self.position[0] + 1, self.position[1])
elif action == "right":
possibleposition = (self.position[0], self.position[1] + 1)
elif action == "left":
possibleposition = (self.position[0], self.position[1] - 1)
if 0 <= possibleposition[0] <= 2:
if 0 <= possibleposition[1] <= 3:
if possibleposition != self.Block:
self.position = possibleposition
return self.position

def chooseAction(self):
global action
maxpossiblereward = 0
if random.uniform(0.0, 1.0) <= 0.3:
action = random.choice(self.actionsList)
else:
for a in self.actionsList:
possiblereward = grid[self.possibleMove(a)]
if possiblereward >= maxpossiblereward:
action = a
maxpossiblereward = possiblereward
return action

def makeAMove(self, action):
position = self.possibleMove(action)
# self.positionsList.append(position)
return position

def Reverse(self, list):
list.reverse()
return list

def play(self, runs):
while runs > 0:
if self.gameOver:
#print("GAMEOVER POSITION {}".format(self.position))
reward = self.takeReward(self.position)
grid[self.position] = reward
self.positionsList = self.Reverse(self.positionsList)
# print("{}".format(self.positionsList))
for position in range(0, len(self.positionsList)):
reward = grid[self.positionsList[position]] + self.alpha * (reward - grid[self.positionsList[position]])
grid[self.positionsList[position]] = round(reward, 3)
#print(reward, grid[self.positionsList[position]])
self.reset()
runs = runs - 1
#print("_______________________________________")
else:
action = self.chooseAction()
#print("Current position: {} , Action: {}".format(self.position, action))
self.position = self.makeAMove(action)
self.positionsList.append(self.position)
self.isGameOver()
#print("Now, position is: {}".format(self.position))
#print("___________________________")

def printGridValues(self):
for i in range(0, 3):
print("----------------------------------")
line = "| "
for j in range(0, 4):
line = line + str(grid[(i, j)]) + " | "
print(line)
print("----------------------------------")

if __name__ == "__main__":
AI().play(50)
AI().printGridValues()``````

You can try experimenting with the size of the grid, position of end states and position and number of blocks in the grid. To make an agent undertake the best path after training, we simply need to make a greedy algorithm that makes the agent move to the immediate neighbour state which has the highest value. In this way, it always reaches the win state at the end.

The github link for the above program is: