Class Meeting 07: Markov Decision Processes (MDPs)


Today's Class Meeting



What You'll Need for Today's Class



Markov Decision Process (MDP)


A Markov decision process (MDP) is defined by the tuple \( \langle S, A, T, R, \gamma \rangle \), where:

Additionally, we can compute

For today's class the environment we're exploring is a grid world, pictured below.

grid world

With this grid world, we define the following Markov decision process \( \langle S, A, T, R, \gamma \rangle \):


Class Exercise #1: Determining MDP Policies with Different Reward Functions


For this exercise, you will determine the optimal policy for the grid world with two different reward functions. For each reward function, you will determine a unique optimal policy.

grid world

Reward function #1:

  • \(R(\textrm{green}) = +1\)
  • \(R(\textrm{red}) = -1\)
  • \(R(\textrm{everywhere else}) = +2\)

Reward function #2:

  • \(R(\textrm{green}) = +1\)
  • \(R(\textrm{red}) = -1\)
  • \(R(\textrm{everywhere else}) = -2\)

Note: You can find the solutions on this page.

Class Exercise #2: Value Iteration Computation


For today's class exercise, please get together in groups of 2-3 students.


Value Iteration


With our grid world and the specified reward function (which is the same as reward function #2 from exercise #1), compute the value for each state in the grid world using the Bellman equation: \(V(s_t) = \textrm{max}_{a_t} \Big( R(s_t,a_t) + \gamma \sum_{s_{t+1} \in S} T(s_t, a_t, s_{t+1}) \: V(s_{t+1}) \Big)\) and the value iteration approach. Before we start with value iteration, you can assume that the value for each state is 0.

grid world

Reward function:

  • \(R(\textrm{green}) = +1\)
  • \(R(\textrm{red}) = -1\)
  • \(R(\textrm{everywhere else}) = -2\)

Note: To best keep track of each state, we will denote each state by it's row and column where the bottom left is the origin [0,0]. For example, the green state at time \(t\) can be represented as \(s_t = [2,3]\)

To keep everyone on the same page, we're all going to follow the same set of trajectories:

Trajectory 1 Trajectory 2 Trajectory 3 Trajectory 4 Trajectory 5 Trajectory 6
\(s_0\) [0,0] [0,0] [0,0] [0,0] [0,0] [0,0]
\(a_0\) UP UP RIGHT RIGHT UP RIGHT
\(s_1\) [1,0] [1,0] [0,1] [0,1] [1,0] [0,1]
\(a_1\) UP UP RIGHT RIGHT UP RIGHT
\(s_2\) [2,0] [2,0] [0,2] [0,2] [2,0] [0,2]
\(a_2\) RIGHT RIGHT UP RIGHT RIGHT UP
\(s_3\) [2,1] [2,1] [1,2] [0,3] [2,1] [1,2]
\(a_3\) RIGHT RIGHT RIGHT UP RIGHT UP
\(s_4\) [2,2] [2,2] [1,3] [1,3] [2,2] [2,2]
\(a_4\) RIGHT RIGHT RIGHT RIGHT
\(s_5\) [2,3] [2,3] [2,3] [2,3]

To get a better sense of how to attack this problem, we'll help you out by computing the values for the first trajectory and the first two steps of the second trajectory:

Note: You can find the solutions on this page.

Acknowledgments


The lecture for today's class was informed by Kaelbling et al. (1996) and Charles Isbell and Michael Littman's Udacity course on Reinforcement Learning.