Lab 5: LLM-Based Human-Robot Dialogue


Learning Goals


To Complete Before Lab


Preparing Your Development Environment

Beyond the steps you've already taken in Lab 3, this lab requires a few additional setup steps. Follow these in order.

  1. Python version: Ensure Python >= 3.10. Check with:
    python3 --version
    If needed, install a newer Python and recreate your virtual environment.
  2. Create your Lab 5 folder: Inside hri_course_misty_programming, create a folder named lab_5_LLM_based_human_robot_dialogue for the lab code:
    cd hri_course_misty_programming
      mkdir lab_5_LLM_based_human_robot_dialogue
  3. Choose your speech-to-text provider: This lab provides two versions of the main code file — try RevAI first and if that does not work then go ahead and try Whisper.
    • llm_based_human_robot_dialogue_revai.py — streams mic audio to Rev.ai via WebSocket. Lower latency, but the WebSocket SDK can have SSL handshake issues on some macOS setups.
    • llm_based_human_robot_dialogue_whisper.py — records a full utterance locally, then sends a WAV file to OpenAI Whisper. Simpler and more portable, with slightly more latency at the end of each utterance.
    If the Rev.ai version crashes with SystemError: new style getargs format but argument is not a tuple on your machine, switch to the Whisper version.
  4. Install additional Python packages: Activate your virtual environment and run:
    source venv/bin/activate
      pip install pyaudio
      pip install google-genai
      pip install openai
      pip install mutagen
    If you're using the Rev.ai version of the code, also run:
    pip install rev-ai
  5. Store API keys in a .env file: Create a file named .env inside hri_course_misty_programming and paste the Gemini and OpenAI API keys you were provided via email/Canvas. If you're using the Rev.ai version, also add your REVAI_ACCESS_TOKEN.
  6. Add the starter code: Inside hri_course_misty_programming either use our template repo or manually copy the starter files into lab_5_LLM_based_human_robot_dialogue.
  7. Use this folder structure:
    hri_course_misty_programming/
      ├── venv/
      ├── Python-SDK/
      ├── lab_3_misty_introduction/
      │   └── misty_introduction.py
      ├── lab_4_misty_woz_gui/
      │   └── lab_4_misty_woz_gui.py
      └── lab_5_LLM_based_human_robot_dialogue/
          ├── llm_based_human_robot_dialogue_revai.py      # Rev.ai version
          ├── llm_based_human_robot_dialogue_whisper.py    # Whisper version
          ├── three_good_things_system_instruction.txt
          └── test_dependencies.py
    Important: Python-SDK and your lab folders should be directly inside hri_course_misty_programming. Do not put the SDK inside your lab folder.
  8. Test your setup: From within your lab folder (with the virtualenv activated) run:
    python3 test_dependencies.py
    If everything is configured correctly, the script should exit without errors.

Potential Issues


Working in Groups


During this lab, you will work with the same group that you worked with for Lab 4. Similar to Lab 4, each group will turn in one piece of code / set of deliverables.

Lab 5 Deliverables & Submission


With the starter code we've provided, in Lab 5 you are expected to:

Your are expected to upload the following to Canvas after you have completed the lab:

To receive credit for this lab, you will need to submit your video and code to Canvas by Thursday, April 23, 2026 at 11:59pm.

Running the Code


  1. Activate your virtual environment: source venv/bin/activate
  2. Run an HTTP server from your hri_course_misty_programming directory: python3 -m http.server
    • This is required because the Misty robot needs to access the speech files generated using OpenAI in order to play them on the robot.
    • If port 8000 is already in use, you can specify a different port (e.g., python3 -m http.server 8080) and update the HTTP_SERVER_PORT variable at the top of your dialogue file to match.
  3. Run the code — pick one based on which speech-to-text provider you chose:
    • python3 llm_based_human_robot_dialogue_revai.py MISTY_IP_ADDRESS
    • python3 llm_based_human_robot_dialogue_whisper.py MISTY_IP_ADDRESS

An Overview of the Starter Code


The starter code contains several files:

Talking Back-and-Forth with Misty: Speech-to-Text, Text Generation, Text-to-Speech

While it is not required to know how the dialogue code works in detail for the purposes of completing this lab, I want to provide a brief overview for those interested in how it enables Misty to have a back-and-forth conversation with a person. This conversation consists of three main steps: speech-to-text, text generation, and text-to-speech.

Speech-to-text: This lab provides two interchangeable implementations for transcribing the human participant's speech to text. Both begin by turning Misty's LED blue and opening a local microphone stream using PyAudio in start_listening(); they differ in how the audio reaches the transcription service.

Text generation: The code in this lab uses Gemini's text generation chat model via the new google-genai SDK, allowing for multi-turn conversations. The chat session is initialized using client.chats.create() with gemini-2.5-flash in the __init__ method of starter code. The text generation occurs inside execute_human_robot_dialogue() via chat.send_message().

Text-to-Speech: The text generated by the Gemini model is then converted to speech using OpenAI's text-to-speech API. This conversion occurs inside execute_human_robot_dialogue() in starter code and the resulting audio file is then played on the robot.

Prompt Engineering


The primary focus of this lab will be on prompt engineering. In the three_good_things_system_instruction.txt file, you will find a system instruction that is used to prompt the Gemini model to generate text for Misty. Right now, the system instruction guides the behavior of a robot receptionist in the CS department at UChicago. You will need to modify this system instruction to enable Misty to guide a human participant through the "Three Good Things" exercise.

If you want to test your system prompt independently from the Misty robot, you can do so by running gen_ai_test.py from the starter code in your terminal. This will allow you to communicate with the model only with text, enabling you to develop more quickly.

As a reminder, here is the desired interaction flow for the "Three Good Things" positive psychology exercise:

Robot Expressions


For this lab, you are asked to develop 5 additional custom actions for the robot. To develop these custom actions, we recommend you check out the following resources:

Your new robot expressions should be added to the custom_actions dictionary in your dialogue file and in the <your_expression> tag within three_good_things_system_instruction.txt. The rest of this section delves into how the robot expressions are executed within the starter code.

How the Robot Expressions Work in the Starter Code

In the starter code, we have defined four robot expressions, called actions in the Misty SDK, in the custom_actions dictionary at the top of your dialogue file:

custom_actions = {
    "reset": "IMAGE:e_DefaultContent.jpg; ARMS:40,40,1000; HEAD:-5,0,0,1000;",
    "head-up-down-nod": "IMAGE:e_DefaultContent.jpg; HEAD:-15,0,0,500; PAUSE:500; HEAD:5,0,0,500; PAUSE:500; HEAD:-15,0,0,500; PAUSE:500; HEAD:5,0,0,500; PAUSE:500; HEAD:-5,0,0,500; PAUSE:500;",
    "hi": "IMAGE:e_Admiration.jpg; ARMS:-80,40,100;",
    "listen": "IMAGE:e_Surprise.jpg; HEAD:-6,30,0,1000; PAUSE:2500; HEAD:-5,0,0,500; IMAGE:e_DefaultContent.jpg;"
}

While the actions are defined in string format in the custom_actions dictionary, they are registered on the Misty robot inside MistyRobot.__init__() in starter code. When the Gemini model (self.chat) generates a text response for Misty to speak, it will also generate an action expression for the robot that corresponds with that text (e.g., "hi", "listen"), which is then parsed from the JSON response inside execute_human_robot_dialogue().

These expressions can be generated by the Gemini model because the list of expressions the robot can execute are provided in the system instruction (three_good_things_system_instruction.txt):

<your_expression>
Your expression should be one of the ones from this list. 
These expressions can represent how you are feeling or be a reaction to what the student has said.
Please refrain from choosing an expression multiple times in a row: [
'head-up-down-nod',
'hi',
'listen'
]
</your_expression>

After the expression is generated by the Gemini chat model, it is looked up in custom_actions and executed on the robot via self.misty.start_action() inside execute_human_robot_dialogue() in the starter code.

OpenAI Voices


The final component for your assignment is exploring the voice options from OpenAI. In your dialogue file, the text-to-speech call inside execute_human_robot_dialogue() in the starter code looks like this:

# OpenAI text-to-speech: generating speech and saving to a file
with self.openai_client.audio.speech.with_streaming_response.create(
    model="gpt-4o-mini-tts",
    voice="alloy",
    instructions="Speak with a calm and encouraging tone.",
) as response:
    response.stream_to_file(self.speech_file_path_local)

You will need to replace the voice and instructions parameters with your own selection. You can play around with the available voices and instructions for the voices at https://www.openai.fm/.