Lab 5: LLM-Based Human-Robot Dialogue

Learning Goals

Students will gain exposure to programming Misty to talk adaptively with a human participant leveraging Rev.ai or OpenAI Whisper (speech-to-text), Gemini (text generation), and OpenAI (text-to-speech).
Students will learn prompt-engineering techniques for the text-generation model so Misty can guide the "Three Good Things" exercise.
Students will gain experience programming Misty expressions to make the robot appear dynamic during conversation.
Students will explore OpenAI voice options and TTS settings for the robot.

To Complete Before Lab

Preparing Your Development Environment

Beyond the steps you've already taken in Lab 3, this lab requires a few additional setup steps. Follow these in order.

Python version: Ensure Python >= 3.10. Check with:
```
python3 --version
```
If needed, install a newer Python and recreate your virtual environment.
Create your Lab 5 folder: Inside hri_course_misty_programming, create a folder named lab_5_LLM_based_human_robot_dialogue for the lab code:
```
cd hri_course_misty_programming
  mkdir lab_5_LLM_based_human_robot_dialogue
```
Choose your speech-to-text provider: This lab provides two versions of the main code file — try RevAI first and if that does not work then go ahead and try Whisper.
- llm_based_human_robot_dialogue_revai.py — streams mic audio to Rev.ai via WebSocket. Lower latency, but the WebSocket SDK can have SSL handshake issues on some macOS setups.
- llm_based_human_robot_dialogue_whisper.py — records a full utterance locally, then sends a WAV file to OpenAI Whisper. Simpler and more portable, with slightly more latency at the end of each utterance.
If the Rev.ai version crashes with SystemError: new style getargs format but argument is not a tuple on your machine, switch to the Whisper version.

Install additional Python packages: Activate your virtual environment and run:

source venv/bin/activate
  pip install pyaudio
  pip install google-genai
  pip install openai
  pip install mutagen

If you're using the Rev.ai version of the code, also run:

pip install rev-ai

Store API keys in a .env file: Create a file named .env inside hri_course_misty_programming and paste the Gemini and OpenAI API keys you were provided via email/Canvas. If you're using the Rev.ai version, also add your REVAI_ACCESS_TOKEN.
Add the starter code: Inside hri_course_misty_programming either use our template repo or manually copy the starter files into lab_5_LLM_based_human_robot_dialogue.

Use this folder structure:

hri_course_misty_programming/
  ├── venv/
  ├── Python-SDK/
  ├── lab_3_misty_introduction/
  │   └── misty_introduction.py
  ├── lab_4_misty_woz_gui/
  │   └── lab_4_misty_woz_gui.py
  └── lab_5_LLM_based_human_robot_dialogue/
      ├── llm_based_human_robot_dialogue_revai.py      # Rev.ai version
      ├── llm_based_human_robot_dialogue_whisper.py    # Whisper version
      ├── three_good_things_system_instruction.txt
      └── test_dependencies.py

Important: Python-SDK and your lab folders should be directly inside hri_course_misty_programming. Do not put the SDK inside your lab folder.

Test your setup: From within your lab folder (with the virtualenv activated) run:
```
python3 test_dependencies.py
```
If everything is configured correctly, the script should exit without errors.

Potential Issues

PyAudio fails to install on Mac (portaudio.h file not found): PyAudio requires PortAudio to be installed at the system level before pip can build it. Install PortAudio via Homebrew first, then retry:
```
brew install portaudio
pip install pyaudio
```
If you don't have Homebrew installed, install it first:
```
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```

Python version is 3.9: This lab is designed for Python >= 3.10. If you hit unexplained errors installing dependencies, upgrade to Python 3.11 and recreate your virtual environment:

deactivate
brew install python@3.11
cd ~/hri_course_misty_programming
rm -rf venv
python3.11 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install google-genai openai python-dotenv mutagen pyaudio requests

If the Misty Python-SDK folder contains a requirements.txt, reinstall its dependencies too:

cd Python-SDK
pip install -r requirements.txt
cd ..

Microphone not picking up audio: PyAudio's default sample rate may not match your system's input device. You can check what your system expects by running the following in a Python shell:
```
import pyaudio
p = pyaudio.PyAudio()
print(p.get_default_input_device_info())
```
If the reported defaultSampleRate is different from 16000 (e.g., 44100 on most Macs), update the AUDIO_RATE variable at the top of your dialogue file to match.
Misty never stops listening, or cuts you off mid-sentence (Whisper version only): The Whisper version uses a volume-based threshold to detect the end of your speech. If Misty never stops listening, the microphone may be picking up background noise — try raising SILENCE_THRESHOLD at the top of llm_based_human_robot_dialogue_whisper.py from 500 to 1000 or 1500. If it cuts you off before you're done talking, either lower SILENCE_THRESHOLD (to 200 or 300) or raise SILENCE_DURATION from 2.0 to 3.0.
Rev.ai crashes with SystemError: new style getargs format but argument is not a tuple: This is a known SSL handshake bug in the websocket-client library that rev-ai depends on. It affects some macOS Python installations regardless of Python version. The Rev.ai version of the starter code includes a monkeypatch that resolves this for most students, but if you still hit the error after running it, switch to llm_based_human_robot_dialogue_whisper.py instead — the Whisper version does not use WebSockets and is not affected by this bug.
Port 8000 is already in use: If the HTTP server fails to start because port 8000 is busy, specify a different port and update the code to match:
```
python3 -m http.server 8080
```
Then set HTTP_SERVER_PORT = 8080 at the top of your dialogue file.

Working in Groups

During this lab, you will work with the same group that you worked with for Lab 4. Similar to Lab 4, each group will turn in one piece of code / set of deliverables.

Lab 5 Deliverables & Submission

With the starter code we've provided, in Lab 5 you are expected to:

Prompt Engineering: Modify three_good_things_system_instruction.txt to enable the Misty robot to guide a human participant through the "Three Good Things" exercise.
Misty expressions: Define 5 additional Misty robot expressions in the custom_actions dictionary of your dialogue file.
OpenAI voice: Select a different OpenAI voice for Misty and put your selection in the OpenAI client definition in your dialogue file.

Your are expected to upload the following to Canvas after you have completed the lab:

A 30-60 second video of Misty engaging in the "Three Good Things" exercise
Your source code: your chosen dialogue file (either llm_based_human_robot_dialogue_revai.py or llm_based_human_robot_dialogue_whisper.py)
Your system instruction: three_good_things_system_instruction.txt

To receive credit for this lab, you will need to submit your video and code to Canvas by Thursday, April 23, 2026 at 11:59pm.

Running the Code

Activate your virtual environment: source venv/bin/activate
Run an HTTP server from your hri_course_misty_programming directory: python3 -m http.server
- This is required because the Misty robot needs to access the speech files generated using OpenAI in order to play them on the robot.
- If port 8000 is already in use, you can specify a different port (e.g., python3 -m http.server 8080) and update the HTTP_SERVER_PORT variable at the top of your dialogue file to match.
Run the code — pick one based on which speech-to-text provider you chose:
- python3 llm_based_human_robot_dialogue_revai.py MISTY_IP_ADDRESS
- python3 llm_based_human_robot_dialogue_whisper.py MISTY_IP_ADDRESS

An Overview of the Starter Code

The starter code contains several files:

The main code required to run the lab (pick one):
- llm_based_human_robot_dialogue_revai.py - Rev.ai streaming version
- llm_based_human_robot_dialogue_whisper.py - OpenAI Whisper record-and-transcribe version
- three_good_things_system_instruction.txt - the system instruction for the Gemini generative text model
Code for testing separate lab components:
- test_dependencies.py - used to test the dependency packages and API keys required for this lab
- test_custom_actions.py - used to test the custom actions you will develop for Misty
- gen_ai_test.py - used to test the Gemini generative text model based on the system instruction (three_good_things_system_instruction.txt) without needing to be connected to or run anything on the robot

Talking Back-and-Forth with Misty: Speech-to-Text, Text Generation, Text-to-Speech

While it is not required to know how the dialogue code works in detail for the purposes of completing this lab, I want to provide a brief overview for those interested in how it enables Misty to have a back-and-forth conversation with a person. This conversation consists of three main steps: speech-to-text, text generation, and text-to-speech.

Speech-to-text: This lab provides two interchangeable implementations for transcribing the human participant's speech to text. Both begin by turning Misty's LED blue and opening a local microphone stream using PyAudio in start_listening(); they differ in how the audio reaches the transcription service.

The Rev.ai version (llm_based_human_robot_dialogue_revai.py) streams mic audio chunks to Rev.ai's streaming API over a WebSocket as you speak, and Rev.ai returns partial and final hypotheses in real time. Once a final hypothesis is followed by a silence timeout, the transcript is stored in self.current_transcript.
The Whisper version (llm_based_human_robot_dialogue_whisper.py) records locally, watching the volume of each chunk: once it detects speech followed by a configurable duration of silence, it stops recording, saves the audio as a WAV file, and sends it to OpenAI's Whisper API in a single HTTP request. The returned transcript is stored in self.current_transcript.

Text generation: The code in this lab uses Gemini's text generation chat model via the new google-genai SDK, allowing for multi-turn conversations. The chat session is initialized using client.chats.create() with gemini-2.5-flash in the __init__ method of starter code. The text generation occurs inside execute_human_robot_dialogue() via chat.send_message().

Text-to-Speech: The text generated by the Gemini model is then converted to speech using OpenAI's text-to-speech API. This conversion occurs inside execute_human_robot_dialogue() in starter code and the resulting audio file is then played on the robot.

Prompt Engineering

The primary focus of this lab will be on prompt engineering. In the three_good_things_system_instruction.txt file, you will find a system instruction that is used to prompt the Gemini model to generate text for Misty. Right now, the system instruction guides the behavior of a robot receptionist in the CS department at UChicago. You will need to modify this system instruction to enable Misty to guide a human participant through the "Three Good Things" exercise.

If you want to test your system prompt independently from the Misty robot, you can do so by running gen_ai_test.py from the starter code in your terminal. This will allow you to communicate with the model only with text, enabling you to develop more quickly.

As a reminder, here is the desired interaction flow for the "Three Good Things" positive psychology exercise:

Introduction: The robot introduces itself and the "Three Good Things" exercise to the human participant. Your introduction should have a bit of informal chit-chat where, for example, the robot may ask for the participants' name, ask them how they're doing, etc.
Robot Disclosure #1: The robot will start the exercise by sharing one thing that it is grateful for. Then, it will prompt the participant to share one thing they're grateful for.
Participant Disclosure #1: The human participant shares one thing they are grateful for.
Robot Response to Participant Disclosure #1: The robot responds in 1 sentence (or so) to what the human participant has shared.
Robot Disclosure #2: The robot shares a second thing it is grateful for.
Participant Disclosure #2: The human participant shares a second thing they are grateful for.
Robot Response to Participant Disclosure #2: The robot responds to what the human participant has shared.
Robot Disclosure #3: The robot shares a third thing it is grateful for.
Participant Disclosure #3: The human participant shares a third thing they are grateful for.
Robot Response to Participant Disclosure #3: The robot responds to what the human participant has shared.
Robot Conclusion: The robot concludes the interaction, thanks the participant for their participation, and says goodbye.

Robot Expressions

For this lab, you are asked to develop 5 additional custom actions for the robot. To develop these custom actions, we recommend you check out the following resources:

The list of possible action commands found in the the Misty SDK documentation.
The test_custom_actions.py file in the starter code. This file will allow you to test your just your custom actions without needing to run the whole robot "Three Good Things" exercise.

Your new robot expressions should be added to the custom_actions dictionary in your dialogue file and in the <your_expression> tag within three_good_things_system_instruction.txt. The rest of this section delves into how the robot expressions are executed within the starter code.

How the Robot Expressions Work in the Starter Code

In the starter code, we have defined four robot expressions, called actions in the Misty SDK, in the custom_actions dictionary at the top of your dialogue file:

custom_actions = {
    "reset": "IMAGE:e_DefaultContent.jpg; ARMS:40,40,1000; HEAD:-5,0,0,1000;",
    "head-up-down-nod": "IMAGE:e_DefaultContent.jpg; HEAD:-15,0,0,500; PAUSE:500; HEAD:5,0,0,500; PAUSE:500; HEAD:-15,0,0,500; PAUSE:500; HEAD:5,0,0,500; PAUSE:500; HEAD:-5,0,0,500; PAUSE:500;",
    "hi": "IMAGE:e_Admiration.jpg; ARMS:-80,40,100;",
    "listen": "IMAGE:e_Surprise.jpg; HEAD:-6,30,0,1000; PAUSE:2500; HEAD:-5,0,0,500; IMAGE:e_DefaultContent.jpg;"
}

While the actions are defined in string format in the custom_actions dictionary, they are registered on the Misty robot inside MistyRobot.__init__() in starter code. When the Gemini model (self.chat) generates a text response for Misty to speak, it will also generate an action expression for the robot that corresponds with that text (e.g., "hi", "listen"), which is then parsed from the JSON response inside execute_human_robot_dialogue().

These expressions can be generated by the Gemini model because the list of expressions the robot can execute are provided in the system instruction (three_good_things_system_instruction.txt):

<your_expression>
Your expression should be one of the ones from this list. 
These expressions can represent how you are feeling or be a reaction to what the student has said.
Please refrain from choosing an expression multiple times in a row: [
'head-up-down-nod',
'hi',
'listen'
]
</your_expression>

After the expression is generated by the Gemini chat model, it is looked up in custom_actions and executed on the robot via self.misty.start_action() inside execute_human_robot_dialogue() in the starter code.

OpenAI Voices

The final component for your assignment is exploring the voice options from OpenAI. In your dialogue file, the text-to-speech call inside execute_human_robot_dialogue() in the starter code looks like this:

# OpenAI text-to-speech: generating speech and saving to a file
with self.openai_client.audio.speech.with_streaming_response.create(
    model="gpt-4o-mini-tts",
    voice="alloy",
    instructions="Speak with a calm and encouraging tone.",
) as response:
    response.stream_to_file(self.speech_file_path_local)

You will need to replace the voice and instructions parameters with your own selection. You can play around with the available voices and instructions for the voices at https://www.openai.fm/.