Matt, the Clubhouse bot
Background
In January 2021, nine months after the beginning of the pandemic of our lifetime, people decided that what they really needed was to have more meetings on their phones, and an obscure social network called Clubhouse became super popular for a brief couple of months.
Clubhouse allows users to create and join “rooms”, which are long-running voice-only conference calls. Each room has “the stage”: one or several users who are allowed to speak; everyone else is in “the audience” and can only listen. Audience members can use the “raise hand” feature to request speaking permissions, and room moderator(s) can add and remove people from the stage.
Text chat bots have been a thing for a long time, and I got curious whether it was possible to create a believable voice chat bot using publicly available APIs.
Overview
Please meet Matt: the Clubhouse bot.
Matt is an automatic Clubhouse room moderator. You can join a room managed by Matt, raise your hand and get the stage for 1 minute to speak. After one minute you will be moved back to the audience, and another speaker will be chosen randomly from the audience members who have their hands raised. Matt will occasionally join the conversation as well.
Since Clubhouse has no official API, Matt uses an iPhone with an official Clubhouse app. The phone is connected to a computer running a program that captures audio conversations in the room, plays back responses, and manages participants in the room.
Source code is available here: https://github.com/knyar/housebot
Implementation
- A jailbroken iPhone which has cert pinning disabled to allow sniffing Clubhouse API requests made by the app.
- A separate WiFi access point that passes all web traffic through mitmproxy and keeps a log of all Clubhouse API requests and their content.
- The housebot service written in Go that reads the log and keeps track of all room participants and their status (hands raised, etc). This service can also send requests to the Clubhouse API to add people to “the stage” and remove them.
- The phone has audio input and output connected to the computer, which allows housebot to record audio and play back a response.
- A new speaker gets chosen from the people who have their hands raised. They get a stage for one minute and their audio is getting streamed to the Speech to Text API. The app maintains a transcript of what has recently been said in the room.
- Occasionally the recorded transcript is sent to the GPT-3 API and an AI-generated response is created.
- The response is fed to the Text to Speech API, and played back to the room.
But does it work?
Mostly!
A Clubhouse-style voice conversation proved to be tricky material for automation. When humans speak with each other we rarely talk in full sentences, so the transcript produced by the Speech to Text API is often a long stream of words rather than several distinct sentenes. This stream of words is something that GPT3 has a harder time interpreting and responding to, probably because it was trained on a data set of written (rather than spoken) text.
Surprisingly, the most difficult part was getting the audio transmitted from the phone to the computer. I spent a few days trying to emulate a bluetooth headset on Linux, but at the end just used external USB audio cards to capture and play sound.