Q-Learning Project

Objectives

Your goal in this project is to implement a Q-Learning algorithm to give your robot the ability to learn how to organize items in its environment using reinforcement learning. This project will also involve components of both 1) perception - to detect the items to organize and the locations for drop-off and 2) control - to make the robot arm pick up the items and then navigate to the locations where they are to be dropped off. Like before, if you have questions or find yourself getting stuck, please post on the course Slack or send an email to the teaching team (uchicago.intro.robotics@gmail.com). Also, feel free to share with your peers on Slack what's working well for you.

Learning Goals

Continue gaining experience with ROS and robot programming

Gain experience with robot perception using the robot's RGB camera to identify objects and drop-off locations

Learn about robot manipulation through programming a multiple degree-of-freedom robot arm

Gain experience and intuition with reinforcement learning through a hands-on implementation of the Q-Learning algorithm

Teaming & Logistics

You are expected to work with 1 other student for this project, who is different than your particle filter project partner. If you strongly prefer working by yourself, please reach out to the teaching team to discuss your individual case. A team of 3 will only be allowed if there is an odd number of students. Your team will submit your code and writeup together (in 1 Github repo).

Deliverables

Like before, you'll submit this project on Github (both the code and the writeup). Have one team member fork our starter git repo to get our starter code and so that we can track your project. Both of your team members will contribute code to this one forked repo.

Code

The code that you develop for this project should be in new Python ROS nodes that you create within the scripts folder. DO NOT EDIT THE FILES WE PROVIDE IN THE STARTER GIT REPO (e.g., scripts/reset_world.py, scripts/phantom_robot_movement.py, any of the custom message files in the /msg directory). If you wish to create your own launch files or additional custom ROS messages, you're welcome to do that. We just ask that you don't edit the launch files, ROS nodes, and custom messages that we've provided in the starter code.

Note: If you're interested in learning how to make your own custom launch files (which can run multiple ROS nodes in the same terminal), I'd recommend checking out this Roslaunch tips for larger projects tutorial and this roslaunch XML page.
Additionally, if you're interested in learning how to define custom ROS messages to use in ROS topics you create I'd recommend looking at this Creating a ROS msg and srv tutorial.

Implementation Plan

Please put your implementation plan within your README.md file. Your implementation plan should contain the following:

The names of your team members
A 1-2 sentence description of how your team plans to implement each of the following components of this project as well as a 1-2 sentence description of how you will test each component:
- Q-learning algorithm
  - Executing the Q-learning algorithm
  - Determining when the Q-matrix has converged
  - Once the Q-matrix has converged, how to determine which actions the robot should take to maximize expected reward
- Robot perception
  - Determining the identities and locations of the three colored dumbbells
  - Determining the identities and locations of the three numbered blocks
- Robot manipulation & movement
  - Picking up and putting down the dumbbells with the OpenMANIPULATOR arm
  - Navigating to the appropriate locations to pick up and put down the dumbbells
A brief timeline sketching out when you would like to have accomplished each of the components listed above.

Writeup

Modify the README.md file as your writeup for this project. Please add pictures, Youtube videos, and/or embedded animated gifs to showcase and describe your work. Your writeup should contain:

Objectives description (2-3 sentences): Describe the goal of this project.
High-level description (1 paragraph): At a high-level, describe how you used reinforcement learning to solve the task of determining which dumbbells belong in front of each numbered block.
Q-learning algorithm description: Describe how you accomplished each of the following components of the Q-learning algorithm in 1-3 sentences, and also describe what functions / sections of the code executed each of these components(1-3 sentences per function / portion of code):
- Selecting and executing actions for the robot (or phantom robot) to take
- Updating the Q-matrix
- Determining when to stop iterating through the Q-learning algorithm
- Executing the path most likely to lead to receiving a reward after the Q-matrix has converged on the simulated Turtlebot3 robot
Robot perception description: Describe how you accomplished each of the following components of the perception elements of this project in 1-3 sentences, any online sources of information/code that helped you to recognize the objects, and also describe what functions / sections of the code executed each of these components (1-3 sentences per function / portion of code):
- Identifying the locations and identities of each of the colored dumbbells
- Identifying the locations and identities of each of the numbered blocks
Robot manipulation and movement: Describe how you accomplished each of the following components of the robot manipulation and movement elements of this project in 1-3 sentences, and also describe what functions / sections of the code executed each of these components (1-3 sentences per function / portion of code):
- Moving to the right spot in order to pick up a dumbbell
- Picking up the dumbbell
- Moving to the desired destination (numbered block) with the dumbbell
- Putting the dumbbell back down at the desired destination
Challenges (1 paragraph): Describe the challenges you faced and how you overcame them.
Future work (1 paragraph): If you had more time, how would you improve your implementation?
Takeaways (at least 2 bullet points with 2-3 sentences per bullet point): What are your key takeaways from this project that would help you/others in future robot programming assignments working in pairs? For each takeaway, provide a few sentences of elaboration.

gif

In your writeup, include a gif of your robot successfully executing the task once your Q matrix has converged.

rosbag

Record a run of your q-learning algorithm in a rosbag. Please record the following topics: /cmd_vel, /gazebo/set_model_state, /q_learning/q_matrix, /q_learning/reward, /q_learning/robot_action, /scan, and any other topics you generate and use in your particle filter project. Please do not record all of the topics, since the camera topics make the rosbags very large. For ease of use, here's how to record a rosbag:

$ rosbag record -O filename.bag topic-names

Please refer to the ROS Resources page for further details on how to record a rosbag.

Partner Contributions Survey

The final deliverable is ensuring that each team member completes the Partner Contributions Google Survey. The purpose of this survey is to accurately capture the contributions of each partner to your combined particle filter project deliverables.

Deadlines & Submission

Wednesday, February 17 11:00am CST - Implementation Plan
Monday, March 1 11:00am CST - Code, Writeup, gif, rosbag, Partner Contributions Survey

Submission

As was true with the prior projects, we will consider your latest commit before 11:00 AM CST as your submission for each deadline. Do not forget to push your changes to your forked github repos. You do not need to email us your repo link, since we will be able to track your repo via your fork. If you want to use any of your flex late hours for this assignment, please email us and let us know (so we know to clone your code at the appropriate commit for grading).

Your Objective

Your goal in this project is to computationally determine what actions the robot should take in order to achieve the goal state (where each colored dumbbell is placed in front of the correct numbered block) using reinforcement learning. Additionally, you will program the robot to pick up, move, and place each colored dumbbell in accordance with the learned goal state.

To launch the Gazebo world for this project, run:

$ roslaunch q_learning_project turtlebot3_intro_robo_manipulation.launch

You should see the world pictured below. Our Turtlebot3 is now equipped with an OpenMANIPULATOR arm.

One important feature to make you aware of is that this Gazebo world file is launched with the parameter paused set to true (turtlebot3_intro_robo_manipulation.launch lines 4 and 12). Whenever you want to start running, you'll need to press the play button the bottom left hand corner of your Gazebo window (circled in light blue in the picture above).

Robot Actions: Picking Up Dumbbells and Placing them in Front of Numbered Blocks

One key component to this project is building a ROS node that can execute actions published to the /q_learning/robot_action ROS topic (with a custom message type of q_learning/RobotMoveDBToBlock). When your ROS node receives a message on this topic, it should:

Move to and pick up the dumbbell specified by the robot_db attribute of the q_learning/RobotMoveDBToBlock message
Carry the dumbbell to the block specified by the block_id attribute of the q_learning/RobotMoveDBToBlock message, and
Put the dumbbell down.

Write your code for this node in new Python file(s) within the /scripts directory. The following subsections will give you some more details and helpful tips on the perception and robot manipulator control components to programming these robot actions.

Perception

In order to pick up the dumbbells and carry them to the numbered block locations, your robot will need to be able to perceive:

Each of the three colored dumbbells and their locations, and
Each of the three numbered blocks and their locations

At the end of each iteration of the Q-learning algorithm, the blocks and dumbbells will randomly switch their order (see the gif in the "Learning Task" section). So your robot will need to be able to discern the difference, for example, between the green and red dumbbell as well as between block 1 and block 3.

Note: For the perception part of this Q-learning project it is OK, and actually encouraged, to seek implementations on the internet for number/digit recognition and color recognition of objects from the RGB camera feed of the robot. If you do use code or information from online sources outside of this class website, cite them appropriately both in your code and in your writeup.

To launch the Turtlebot3 RViz window, run:

$ roslaunch turtlebot3_gazebo turtlebot3_gazebo_rviz.launch

This should bring up an RViz window like the one pictured below.

One important thing to note is that you can visualize what the robot sees through it's RGB camera by checking the check box next to "Camera" (see the image above). Once you do, you can see "through the eyes of the robot" (see the image below).

You'll likely want to use a combination of the /scan and /camera/rgb/image_raw ROS topics to identify and locate the objects in the environment. For the detection of the dumbbells, you're more than welcome to leverage the code that we used for the line follower in class meeting 03.

Digit Recognition

While you are free to use any method you can find online recognizing the digits on the blocks, we recommend keras_ocr, which provides pre-trained and an end-to-end training pipeline for character recognition. On a high level, you can input an image and expect an output that details the characters found in the image and their location. For this project, you need to only use the pre-trained models.

To use keras_ocr, you need to first install it via

$ pip install keras-ocr

Next, you need to set up the pipeline in your python script,

import keras_ocr
.
.
.
# download pre-trained model
pipeline = keras_ocr.pipeline.Pipeline()

# Once you have the pipeline, you can use it to recognize characters,

# images is a list of images in the cv2 format
images = [img1, img2, ...]

# call the recognizer on the list of images
prediction_groups = pipline.recognize(images)

# prediction_groups is a list of predictions for each image
# prediction_groups[0] is a list of tuples for recognized characters for img1
# the tuples are of the formate (word, box), where word is the word
# recognized by the recognizer and box is a rectangle in the image where the recognized words reside

For more information, please visit the documentation.

Robot Manipulator Control

In order to enable your robot to pick up the dumbbells, you'll need to get familiar with programming the Turtlebot3's OpenMANIPULATOR arm. Here's a list of resources to help you get up and running:

Class Exercise #2 from Class Meeting 09: Robot Manipulator Kinematics
The Simulation section on the Manipulation page of the Turtlebot3 ROBOTIS e-Manaual
The MoveIt tutorials, and specifically the Python interface for MoveIt tutorial and the Python code referenced in that tutorial
- We encourage you to read through this Python code and apply what you learn to your implementation

The following gif shows an example of the Turtlebot3 OpenMANIPULATOR arm picking up one of the dumbbells (note: this was in a prior iteration of the development of this project before the dumbbells had colors).

Learning Task

As we mentioned above, your goal in this project is to computationally determine what actions the robot should take in order to achieve the goal state (where each colored dumbbell is placed in front of the correct numbered block) using reinforcement learning, and specifically, Q-learning. You will implement your Q-learning algorithm in a new ROS node that you'll compose in a new Python file(s) within the /scripts directory. Your Q-learning algorithm should proceed as follows:

Q-Learning Algorithm

\(\textrm{Algorithm Q_Learning}:\)
\( \qquad \textrm{initialize} \: Q \)
\( \qquad t = 0 \)
\( \qquad \textrm{while} \: Q \: \textrm{has not converged:} \)
\( \qquad \qquad \textrm{select} \: a_t \: \textrm{at random} \)
\( \qquad \qquad \textrm{perform} \: a_t \)
\( \qquad \qquad \textrm{receive} \: r_t \)
\( \qquad \qquad Q(s_t, a_t) \leftarrow Q(s_t, a_t) + \alpha \cdot \Big( r_t + \gamma \cdot \: \textrm{max}_a Q(s_{t+1}, a) - Q(s_t, a_t) \Big)\)
\( \qquad \qquad t = t + 1\)

Where:

\(Q: S \times A \rightarrow \mathbb{R} \), meaning that for a given state \( (s_t) \) and action \( (a_t) \), \(Q(s_t, a_t)\) is a real number \((\mathbb{R})\)
\(s_t \in S\) represents the state space - all possible states of the world
\(a_t \in A\) represents the action space - all possible actions the robot could take
\(r_t \) is the reward received when action \(a_t\) is taken from state \(s_t\) according to the reward function \(r(s_t, a_t)\)
\(\alpha\) represents the learning rate \((0 \le \alpha \le 1) \), the closer to 1 the learning rate is, the more rapidly your Q-matrix will update to observed rewards; for this project it is recommended that you start out with \( \alpha = 1\), if you want to experiment with the learning rate, we'd recommend doing that only after have a fully working Q-learning algorithm with \( \alpha = 1\)
\(\gamma\) is the discount factor where \((0 \le \gamma \le 1) \), which indicates how much future rewards are considered in the actions of a system, \(\gamma = 0\) would indicate a scenario in which a robot only cares about the rewards it could get in a single moment, as where \(\gamma = 1\) indicates a scenario where the robot doesn't care at all about the short term and pursues whatever actions will produce the best long-term reward (regardless of how long that takes), for this project, we recommend choosing a value for \(\gamma\) somewhere in between 0 and 1

ROS Topics & Messages

You will need to publish and subscribe to several ROS topics to complete your Q-learning task:

ROS Topic	ROS msg Type	Notes
`/q_learning/q_matrix`	`q_learning/QMatrix`	Each time you update your Q-matrix in your Q-learning algorithm, publish your Q-matrix to this topic.
`/q_learning/reward`	`q_learning/QLearningReward`	You will subscribe to this topic to receive from the environment the reward after each action you take.
`/q_learning/robot_action`	`q_learning/RobotMoveDBToBlock`	Every time you want to execute an action, publish a message to this topic (this is the same topic you'll be subscribing to in the node you write to have your robot execute the actions).

Setting Up the Q-matrix and the Action Matrix

For this project, we will represent \(Q\) as a matrix, where the rows correspond with the possible world states \(s_t\) and the columns represent actions the robot can take \(a_t\). The actions that the robot can take should be organized as follows:

action number (column)	move dumbbell	to block number
0	red	1
1	red	2
2	red	3
3	green	1
4	green	2
5	green	3
6	blue	1
7	blue	2
8	blue	3

There are 64 possible states:

state number (row)	red dumbbell location	green dumbbell location	blue dumbbell location
0	origin	origin	origin
1	block 1	origin	origin
2	block 2	origin	origin
3	block 3	origin	origin
4	origin	block 1	origin
5	block 1	block 1	origin
6	block 2	block 1	origin
7	block 3	block 1	origin
8	origin	block 2	origin
9	block 1	block 2	origin
10	block 2	block 2	origin
11	block 3	block 2	origin
12	origin	block 3	origin
13	block 1	block 3	origin
14	block 2	block 3	origin
15	block 3	block 3	origin
...	...	...	...
63	block 3	block 3	block 3

Where

states 16 - 31 repeat states 0-15, except with the blue dumbbell location at block 1,
states 32 - 47 repeat states 0-15 except with the blue dumbbell location at block 2, and
states 48 - 63 repeat states 0-15 except with the blue dumbbell location at block 3.

In addition to a Q-matrix, you will also construct an action matrix. The rows of the action matrix will represent a starting state \((s_t)\) and the columns of the action matrix will represent the next state \((s_{t+1})\). You will set up this matrix such that \(\textrm{action_matrix}[s_t][s_{t+1}] = a_t \). Let's examine the example of \(\textrm{action_matrix}[0][12] = 5 \). In this case \(s_t = 0\), where all three dumbbells are at the origin, and \(s_{t+1} = 12\), where the red and blue dumbbells are at the origin and the green dumbbell is at block three, and \(a_t = 5\) which is the action corresponding with the robot taking the green dumbbell to block number 3.

All transitions from \(s_t\) to \(s_{t+1}\) that are impossible or invalid should be assigned the value -1. For example, since the robot can only carry one dumbbell at a time, the transition from state 0 to 6 is impossible. Additionally, only one dumbbell can sit in front of one numbered block at a time, so any transition to state 5 (where both the red and green dumbbells are at block 1) is also impossible and should be given the value -1.

The value of setting up this action matrix comes into play when we're executing the \( \textrm{select} \: a_t \: \textrm{at random} \) step of the Q-learning algorithm. In order to select a random action, we take our current state \((s_t)\), and look up the row corresponding with that state in the action matrix. All the values that are not -1 represent valid actions that the robot can take from state \(s_t\). You can then pick one of these at random.

Q-Matrix Convergence

As highlighted in the Q-learning algorithm, you will iterate through the while loop, updating your Q-matrix, until your Q-matrix has converged. What we mean by "convergence" in this context is that your Q-matrix has reached its final form and no more updates or changes will occur to it. It's up to you to determine how to ascertain when your Q-matrix has converged.

Iterating through the Q-Learning Algorithm Quickly

In order to reach convergence of your Q-matrix, you're going to have to run many iterations of having the robot place the dumbbells in front of the numbered blocks. To make this more efficient, we've created a "phantom robot movement" node that moves the dumbbells as if a "phantom robot" were doing it:

$ rosrun q_learning_project phantom_robot_movement.py

This phantom robot movement ROS node subscribes to robot actions given on the /q_learning/robot_action ROS topic, so as long as you're sending robot actions on this topic and have the phantom robot movement node running, you should see something like what's pictured in the following gif.

using phantom robot movement to do q-learning

This phantom robot movement node is designed for your use while you're waiting for your Q-matrix to converge. Once your Q-matrix has converged, you'll need to shut down the phantom robot movement ROS node and get your own ROS node running to have the robot execute the moments by actually picking up and moving the blocks itself.

Extracting Action Commands that Maximize Future Reward

Once your Q-matrix converges, you now have a Q-matrix that contains information about future expected reward for robot actions. You can now use the Q-matrix to make decisions about actions to take that will lead to the highest expected future reward. To do this, take your current state \(s_t\) and look up the corresponding row in your Q-matrix. In this row, find the action (column) that corresponds with the highest Q-value. This is the action that will lead to the highest expected future reward.

The Big Picture

Once you have implemented everything that's outlined above, this is how your program should work:

Launch turtlebot3_intro_robo_manipulation.launch
Run the phantom robot movement ROS node
Run your Q-learning ROS node and keep it running until your Q-matrix converges
Once your Q-matrix has converged, shut down the phantom robot movement ROS node and start up your own ROS node that executes the robot action commands to pick up the dumbbells and place them in front of the numbered blocks
Extract an action sequence from your Q-matrix that will maximize the reward that the robot expects to receive in the future and send the action commands to have your robot execute these actions until the goal state is reached

Acknowledgments

The design of this course project was influenced by Brian Scassellati and his Intelligent Robotics course taught at Yale University. Also, I want to thank my sister, Rachel Strohkorb, for creating the custom dumbbell model for our use in the Gazebo simulator.