Class Meeting 08: Reinforcement Learning


Today's Class Meeting



What You'll Need for Today's Class



Q-Learning Algorithm


\(\textrm{Algorithm Q_Learning}:\)
\( \qquad \textrm{initialize} \: Q \)
\( \qquad t = 0 \)
\( \qquad \textrm{while} \: Q \: \textrm{has not converged:} \)
\( \qquad \qquad \textrm{select} \: a_t \: \textrm{at random} \)
\( \qquad \qquad \textrm{perform} \: a_t \)
\( \qquad \qquad \textrm{receive} \: r_t \)
\( \qquad \qquad Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \cdot \Big( r_t + \gamma \cdot \: \textrm{max}_a Q(s_{t+1}, a) - Q(s_t, a_t) \Big)\)
\( \qquad \qquad t = t + 1\)

Where:

Finally, once \(Q\) has converged (\(Q^*\)), we can use this following to extract the optimal policy: $$\pi^*(s) = \textrm{arg max}_a \: Q^*(s,a)$$


Class Exercise: Q-Learning in a 5 Room Building


For today's class exercise, please get together in groups of 2-3 students.


Specifying the MDP Model


Let's consider a situation in which our Turtlebot3 robot is placed inside a building that has a floorplan like that shown in the following image.

We can characterize this space as an MDP, where each state represents one room in the building (or outside, e.g., room 5) and where the agent can transition between rooms by moving either north, south, east, or west. The agent cannot stay in the same state from time step to time step, except once it is outside.

actions

We can specify the reward function as depicted in the following diagram, where the agent receives a reward when it transitions outside from either room 1 or room 4. Additionally, the agent continues reaping rewards once it's already outside every time step.

actions

Our robot's goal in this world is to get outside (room 5) from it's starting position in room 2.

actions

Applying the Q-Learning Algorithm in this World


The following table / matrix can be used to represent \(Q(s_t, a_t)\), where the rows represent the world states (rooms) and the columns represent actions that the robot can take. The following table is initialized where all state-action pairs are initialized to 0.

North South East West
rm0 0 0 0 0
rm1 0 0 0 0
rm2 0 0 0 0
rm3 0 0 0 0
rm4 0 0 0 0
rm5 0 0 0 0

Your goal in this exercise is to update \(Q(s_t, a_t)\) according to the Q-learning algorithm. In this exercise we will make the following assumptions:

Here are action trajectories that you'll have your robot agent follow through the environment (assume that your agent always starts in room 2 at the beginning of each trajectory). For each action within each trajectory, update \(Q(s_t, a_t)\) according to the Q-learning algorithm. You can check your answers on this page, which provides \(Q(s_t, a_t)\) after each trajectory.


Extracting a Policy from Your Q-matrix


Now that we've executed 4 trajectories through our world, determine the best policy for our robot, given the information we've stored in our Q-matrix. As a reminder, you can use this formula to determine the policy, which represents the best action the robot should take in each possible state: $$\pi(s) = \textrm{arg max}_a \: Q(s,a)$$

Like for the last exercise, you can check your answers on this page.


More on Reinforcement Learning


If you wan to learn more about reinforcement learning or go further into the topic introduced in class, please check out:


Deep Reinforcement Learning


Deep reinforcement learning combines reinforcement learning and machine learning by feeding in observations about the environment (similar to states) as inputs to a neural network, with the output of the network representing the Q-values of each possible action the robot can take. Deep RL can handle states that are higher-dimensional than traditional Q-learning, such as raw images from a camera or raw sensor data from a robot's LiDAR. The neural network that learns mappings from states to Q-values for each action represent the learned policy.

The following are two examples of cutting-edge research in robotics that leverage deep reinforcement learning to teach robots how to walk and how to grasp objects:

Paper Citation: Rudin, N., Hoeller, D., Reist, P., & Hutter, M. (2022, January). Learning to walk in minutes using massively parallel deep reinforcement learning. In Conference on Robot Learning (pp. 91-100). PMLR.

Paper Citation: Kalashnikov, D., Irpan, A., Pastor, P., Ibarz, J., Herzog, A., Jang, E., Quillen, D., Holly, E., Kalakrishnan, M., Vanhoucke, V., & Levine, S. (2018). Qt-opt: Scalable deep reinforcement learning for vision-based robotic manipulation. arXiv preprint arXiv:1806.10293.


Acknowledgments


The lecture for today's class was informed by Kaelbling et al. (1996), Charles Isbell and Michael Littman's Udacity course on Reinforcement Learning, and geeksforgeeks.org's article on the upper confidence bound for reinforcement learning. The class exercise for today was inspired by the Q-learning tutorial on mnemstudio.org.