Class Meeting 06: Exercise Solutions

On this page you can find solutions to the exercises described in class meeting 06 on Markov decision processes.

Class Exercise #1: Determining MDP Policies with Different Reward Functions

Below you can find the optimal policy for each state given the specified reward function.

Reward function:

$R (green) = + 1$
$R (red) = - 1$
$R (everywhere else) = + 2$

Reward function:

$R (green) = + 1$
$R (red) = - 1$
$R (everywhere else) = - 2$

Class Exercise #2: Value Iteration Computation

Below you can find the value computed for each state along each trajectory in the table below according to the Bellman Equation and the specified trajectories.

Reward function:

$R (green) = + 1$
$R (red) = - 1$
$R (everywhere else) = - 2$

	Trajectory 1	Trajectory 2	Trajectory 3	Trajectory 4	Trajectory 5	Trajectory 6
$s_{0}$	[0,0]	[0,0]	[0,0]	[0,0]	[0,0]	[0,0]
$V$	$V ([0, 0]) = - 2$	$V ([0, 0]) = - 2.32$	$V ([0, 0]) = - 2.47$	$V ([0, 0]) = - 3.77$	$V ([0, 0]) = - 4.89$	$V ([0, 0]) = - 5.09$
$a_{0}$	UP	UP	RIGHT	RIGHT	UP	RIGHT
$s_{1}$	[1,0]	[1,0]	[0,1]	[0,1]	[1,0]	[0,1]
$V$	$V ([1, 0]) = - 2$	$V ([1, 0]) = - 3.6$	$V ([0, 1]) = - 2$	$V ([0, 1]) = - 3.6$	$V ([1, 0]) = - 4.88$	$V ([0, 1]) = - 4.07$
$a_{1}$	UP	UP	RIGHT	RIGHT	UP	RIGHT
$s_{2}$	[2,0]	[2,0]	[0,2]	[0,2]	[2,0]	[0,2]
$V$	$V ([2, 0]) = - 2$	$V ([2, 0]) = - 3.6$	$V ([0, 2]) = - 2$	$V ([0, 2]) = - 2.34$	$V ([2, 0]) = - 4.88$	$V ([0, 2]) = - 3.70$
$a_{2}$	RIGHT	RIGHT	UP	RIGHT	RIGHT	UP
$s_{3}$	[2,1]	[2,1]	[1,2]	[0,3]	[2,1]	[1,2]
$V$	$V ([2, 1]) = - 2$	$V ([2, 1]) = - 3.6$	$V ([1, 2]) = - 2.28$	$V ([0, 3]) = - 2.08$	$V ([2, 1]) = - 3.54$	$V ([1, 2]) = - 3.06$
$a_{3}$	RIGHT	RIGHT	RIGHT	UP	RIGHT	UP
$s_{4}$	[2,2]	[2,2]	[1,3]	[1,3]	[2,2]	[2,2]
$V$	$V ([2, 2]) = - 2$	$V ([2, 2]) = - 1.52$	$V ([1, 3]) = - 1$	$V ([1, 3]) = - 1$	$V ([2, 2]) = - 1.66$	$V ([2, 2]) = - 1.74$
$a_{4}$	RIGHT	RIGHT			RIGHT	RIGHT
$s_{5}$	[2,3]	[2,3]			[2,3]	[2,3]
$V$	$V ([2, 3]) = 1$	$V ([2, 3]) = 1$			$V ([2, 3]) = 1$	$V ([2, 3]) = 1$