Class Meeting 06: Exercise Solutions


On this page you can find solutions to the exercises described in class meeting 06 on Markov decision processes.


Class Exercise #1: Determining MDP Policies with Different Reward Functions


Below you can find the optimal policy for each state given the specified reward function.

grid world

Reward function:

  • R(green)=+1
  • R(red)=1
  • R(everywhere else)=+2

grid world

Reward function:

  • R(green)=+1
  • R(red)=1
  • R(everywhere else)=2


Class Exercise #2: Value Iteration Computation


Below you can find the value computed for each state along each trajectory in the table below according to the Bellman Equation and the specified trajectories.

grid world

Reward function:

  • R(green)=+1
  • R(red)=1
  • R(everywhere else)=2

Trajectory 1 Trajectory 2 Trajectory 3 Trajectory 4 Trajectory 5 Trajectory 6
s0 [0,0] [0,0] [0,0] [0,0] [0,0] [0,0]
V V([0,0])=2 V([0,0])=2.32 V([0,0])=2.47 V([0,0])=3.77 V([0,0])=4.89 V([0,0])=5.09
a0 UP UP RIGHT RIGHT UP RIGHT
s1 [1,0] [1,0] [0,1] [0,1] [1,0] [0,1]
V V([1,0])=2 V([1,0])=3.6 V([0,1])=2 V([0,1])=3.6 V([1,0])=4.88 V([0,1])=4.07
a1 UP UP RIGHT RIGHT UP RIGHT
s2 [2,0] [2,0] [0,2] [0,2] [2,0] [0,2]
V V([2,0])=2 V([2,0])=3.6 V([0,2])=2 V([0,2])=2.34 V([2,0])=4.88 V([0,2])=3.70
a2 RIGHT RIGHT UP RIGHT RIGHT UP
s3 [2,1] [2,1] [1,2] [0,3] [2,1] [1,2]
V V([2,1])=2 V([2,1])=3.6 V([1,2])=2.28 V([0,3])=2.08 V([2,1])=3.54 V([1,2])=3.06
a3 RIGHT RIGHT RIGHT UP RIGHT UP
s4 [2,2] [2,2] [1,3] [1,3] [2,2] [2,2]
V V([2,2])=2 V([2,2])=1.52 V([1,3])=1 V([1,3])=1 V([2,2])=1.66 V([2,2])=1.74
a4 RIGHT RIGHT RIGHT RIGHT
s5 [2,3] [2,3] [2,3] [2,3]
V V([2,3])=1 V([2,3])=1 V([2,3])=1 V([2,3])=1