Matt, the Clubhouse bot

January 2021

Background

In January 2021, nine months after the beginning of the pandemic of our lifetime, people decided that what they really needed was to have more meetings on their phones, and an obscure social network called Clubhouse became super popular for a brief couple of months.

Clubhouse allows users to create and join “rooms”, which are long-running voice-only conference calls. Each room has “the stage”: one or several users who are allowed to speak; everyone else is in “the audience” and can only listen. Audience members can use the “raise hand” feature to request speaking permissions, and room moderator(s) can add and remove people from the stage.

Text chat bots have been a thing for a long time, and I got curious whether it was possible to create a believable voice chat bot using publicly available APIs.

Overview

Please meet Matt: the Clubhouse bot.

Matt is an automatic Clubhouse room moderator. You can join a room managed by Matt, raise your hand and get the stage for 1 minute to speak. After one minute you will be moved back to the audience, and another speaker will be chosen randomly from the audience members who have their hands raised. Matt will occasionally join the conversation as well.

Since Clubhouse has no official API, Matt uses an iPhone with an official Clubhouse app. The phone is connected to a computer running a program that captures audio conversations in the room, plays back responses, and manages participants in the room.

Source code is available here: https://github.com/knyar/housebot

Implementation

But does it work?

Mostly!

A Clubhouse-style voice conversation proved to be tricky material for automation. When humans speak with each other we rarely talk in full sentences, so the transcript produced by the Speech to Text API is often a long stream of words rather than several distinct sentenes. This stream of words is something that GPT3 has a harder time interpreting and responding to, probably because it was trained on a data set of written (rather than spoken) text.

Surprisingly, the most difficult part was getting the audio transmitted from the phone to the computer. I spent a few days trying to emulate a bluetooth headset on Linux, but at the end just used external USB audio cards to capture and play sound.