ChatGPT was released just over a year ago (at the end of November 2022), and countless people have already written about their experiences using it in all sorts of settings. (I even contributed my own hot take last year with my O’Reilly Radar article Real-Real-World Programming with ChatGPT.) What more is left to say by now? Well, I bet very few of those people have actually chatted with ChatGPT. And by “chat” I mean the original sense of the word—to hold a back-and-forth verbal conversation with it just like how you would chat with a fellow human being. I recently chatted with ChatGPT, and I want to use that experience to reflect on the usability of voice interfaces for AI tools based on Large Language Models. I’m personally interested in this topic since I am a professor who researches human-computer interaction, user experience design, and cognitive science, so AI voice interfaces are fascinating to me.

Here’s what I did: In December 2023 I installed the official ChatGPT iOS app from OpenAI on my iPhone and used its voice input mode to hold several hour-long conversations with it while driving long-distance on California highways. I wore standard Apple earbuds with a built-in mic and talked with ChatGPT just like how I would be talking to someone on the phone while driving. These long solo drives were the perfect opportunity to test out ChatGPT’s voice feature because I couldn’t interact with the app using my hands for safety reasons.

I had a very clear use case in mind: I wanted a conversation partner to keep me awake and alert while driving long-distance by myself. I’ve found that listening to music or podcasts doesn’t keep me alert when I’m tired because it’s such a passive experience—but what does keep me awake is having someone to talk to, either in the car or remotely on the phone. Could ChatGPT replace a human conversation partner in this role?

The Good: ChatGPT Made Personalized Podcasts to Keep Me Engaged While Driving

To not bury the lede, it turns out that it did a remarkable job! As I was driving I was able to engage in several hour-long conversations with ChatGPT that ended only because I had to take a rest stop or hit the usage limit for GPT-4. (I pay for a ChatGPT Plus subscription so I can use the most advanced GPT-4 model, but that comes with a usage limit that I usually hit after about an hour.)

The best way to describe my experience is (borrowing a wonderful term my friend coined) that it felt like listening to a personalized podcast. Since ChatGPT did most of the talking, it was a mostly passive listening experience on my part except for times when I wanted to ask follow-up questions or direct it to change topics. Critically, this meant I could still focus most of my attention on driving safely with a level of distraction on par with listening to a podcast. But it kept me more alert than a regular podcast since I could actively direct the flow of the conversation.

For a concrete example of what such a personalized podcast felt like, I started one conversation by straight-up asking ChatGPT to keep me awake while I was driving in Southern California from Los Angeles to San Diego. So it started by making small-talk about road trips in general and asking me about various California landmarks that I’ve visited, culminating in asking me more about San Diego (where I live). When it asked me what places I liked visiting the most here, I mentioned the San Diego Zoo and it started telling me a bit about what makes this particular zoo notable. It mentioned the concept of “naturalistic enclosures”—a term I had not heard before—so I asked it to elaborate on what this meant. ChatGPT’s explanation of this concept got me interested in the history of zoos, especially the progression from keeping animals in cages to today’s cage-less naturalistic enclosures, which aim to be better for animal welfare. During that segment it mentioned the term “menagerie” in passing, which I had not heard of in that context before, so I asked it to elaborate more. It then went back farther in history to describe how a menagerie refers to the phenomenon of ancient rulers keeping exotic animals for display without as much regard for the animals’ well-being. Listening to that made me realize that I had actually heard the term menagerie in reference to a Star Trek episode of some sort, but I forgot which one, so I asked ChatGPT to jog my memory. It turns out that The Menagerie was a very famous episode of the original Star Trek television series, so after chatting about that episode and other famous Star Trek episodes for a bit, we got onto the topic of why that show was canceled after only three seasons but later found a much larger audience in syndication (i.e., re-runs). That in turn got me curious about the concept of syndication in the television business, so ChatGPT dived more into this topic. A few more conversational twists and turns later, then I suddenly realized that the hour had flown by and it was time to pull over for a bathroom break. Success!

Related work from others:  Latest from MIT Tech Review - OpenAI’s latest blunder shows the challenges facing Chinese AI models

Now, I don’t expect you to care at all about the details of the conversation I just described since it wasn’t your conversation—it was mine! But I certainly cared about it at the time since I was genuinely curious to learn more about the topics that ChatGPT mentioned, often offhand in the midst of telling me about something else. It felt a bit like diving down a Wikipedia rabbit hole of following related links, where each follow-up question I asked led it down another meandering path. It was perfect for keeping me from getting bored and sleepy during my long drive.

ChatGPT isn’t just good at this sort of superficial “personalized podcast about Wikipedia-level trivia” … it could also engage me in a more substantive conversation about a task I actually needed help with at the moment. In another hour-long car chat, I prompted ChatGPT to help me design a method to organize my huge collection of almost 30 years’ worth of personal and work-related files for backup. I’ve been diligent about data backup throughout my life, but my files are fragmented amongst different media over the years—burning CDs and DVDs back in the day, several generations of external hard drives (that are in various states of decay), university servers, Dropbox, and other cloud services. For years I had an aspirational goal of unifying all of my backups into one central directory tree, akin to the concept of a monorepo in software development. I’ve recently been brainstorming ideas for how to design such a system and how to deal with the practical challenges of scaling and maintenance. So I figured that ChatGPT could help me brainstorm during one of my long drives. Again it did a good job at engaging me in this bespoke conversation, and the hour flew by before I had to take a rest stop. I won’t bore you with details of what we discussed, but it felt like talking with an expert in data management who was giving me advice about how to deal with my particular challenge.

Intermission: Why It Feels Kind of Magical

Skeptical readers may be thinking at this point, “What’s the big deal, it’s just ChatGPT under the hood. I can already do all this from my computer by typing into the ChatGPT text box!” Although that’s technically true, there’s something magical about being able to do this all hands-free via voice. If you don’t believe me, just try it for an hour. My folk theory is that speaking and listening are hard-wired into our brain’s innate language circuitry, but writing and reading are learned skills (i.e., “software” rather than “hardware” in our brains). And that’s why it feels more magical to hold a verbal conversation with an AI versus having the exact same conversation in a text box on a screen. If the AI is good enough, then it almost feels like you’re talking to a real person … at certain times when I was getting deep into a back-and-forth conversation I nearly forgot I was talking to a machine. However, that illusion broke in several ways …

The Not-So-Good: Usability Limitations of the ChatGPT Voice Interface

Despite my positive experiences with ChatGPT’s voice mode, it still didn’t live up to the gold standard of feeling like I was talking with a fellow human being. That’s okay, though, since this is an incredibly high bar! Here are some of the ways it fell short.

Must speak entire request all at once: Most notably, it felt unnatural to have to speak my entire request all at once without pausing. Whenever I paused for too long, ChatGPT would interpret what I said so far as my request and start processing it. As an analogy, when typing a request in a text chat, you can hit the Enter or Send buttons … imagine how weird it would be if ChatGPT started answering you the very moment you stopped typing for one second! Note that in human conversations, especially face-to-face, we use visual cues to tell whether our conversation partner is done talking or whether they are pausing a bit to think about the next thing to say. Even over the phone, we can tell by vocal inflections whether they are temporarily paused and want to keep talking, or whether they are done with their turn and ready for us to respond. Since ChatGPT can’t do any of that (yet!) I often had to think hard about what I wanted to say and then say it all at once without pausing. This was fine for simple requests like “Tell me more about naturalistic enclosures in zoos,” but for more complex requests like describing some facet of my data backup setup, it was painful to have to blurt out as much as I could without pausing. Even more annoyingly, I would sometimes make mistakes when talking so much all at once without pausing. Ideally the app would do a better job at detecting pauses in human speech, taking both context and vocal intonations into account. An easier hack would be to have a voice command like “DONE” or “OVER” (like when people use walkie-talkies) to signal that I am done talking; however, this would also feel unnatural for casual users.Unpredictable wait times: Wait times (latency) for ChatGPT’s responses are unpredictable, and there aren’t audio cues to help me establish an expectation for how long I need to wait before it responds. There’s a click sound when it starts processing my request, but then I may need to wait a few seconds in silence before hearing a response … maybe it’s only one second or maybe it’s five seconds. That said, if I ask it to browse the web, then it plays a continuous waiting sound; web browsing takes longer, maybe ten to twenty seconds, but at least I get to hear a “waiting” sound. (I don’t mind ChatGPT taking longer here since a human would also take more time to browse the web. However, web browsing is annoying when I don’t explicitly ask it to browse. Oftentimes I want a fast answer but something I say triggers a browse without me intending to.) In contrast, when speaking with a human face-to-face, I can use visual cues to tell whether the other person is deep in thought or when they will likely respond; and even over the phone the other person may say “ummm” or “hold on one sec, lemme think” or “ok let me look this up on the web, hang tight for a while …” if they need more time to think through their response. However, since I don’t get any of these verbal cues from ChatGPT, unpredictable wait times break the illusion of talking to a person.Cannot interrupt while it is speaking: I always had to wait for ChatGPT to completely finish talking before it would listen to my next request. And since I never know ahead of time how long it planned to talk for during a particular turn (i.e., how many words its LLM-generated response is), when I wanted to say something mid-way it was aggravating to have to wait. I later saw that I could actually interrupt it by tapping on the app on my phone screen, but since I was driving and hands-free, I couldn’t safely do that. Also, that seems like a cumbersome interaction; I should be able to just talk when I want to, even when it is talking. This limitation made the conversation feel like we were using a walkie-talkie where only one party can talk at once. And it’s not just me—this concept of overlapping speech is widely-studied in linguistics and communication research. Humans naturally talk over one another for various reasons, so not being able to do this with ChatGPT made our conversation feel less fluid. Even implementing a feature like a voice command for interruption would be great, like maybe if I say “pause” or “wait” then it could stop and await my request.Speech recognition errors: ChatGPT’s speech recognition system (presumably based on OpenAI’s open-source Whisper model) is very good, but it does at times misinterpret what I’m saying. What’s stranger is that sometimes it thinks I said something when I didn’t, maybe because it picked up on background rumbles in my car. Several times I wouldn’t be saying anything and suddenly it responds out of the blue; and when I check the written transcript later, it thinks that I said something like “Thank you for watching!” (which I never said). At other times it tries to prematurely end the conversation even though I’m not done, maybe because it mistakenly detected that I said something along the lines of “Thanks …” without any follow-up. Misrecognizing words is forgivable, but I feel that it shouldn’t ever interpret background sounds as words. Of course, if there were other people in the car with me and either they talked or I was talking to them, then I could also understand how ChatGPT would mistakenly interpret that as being a request for it; always-listening home assistants like Alexa have had this issue for years. A more advanced AI would learn to filter out both other people’s voices and also infer when I was speaking with someone else and not it. For instance, when it detects that my sentence is way off topic, maybe that means I’m speaking with someone else in the car; it could at least ask me “Were you talking to me just now?” when it is uncertain. More generally, the idea of explicitly asking me for clarification when it is uncertain would go a long way toward making these interactions feel more human; that’s what I (a representative human!) would do if I were on a noisy phone connection with someone and didn’t hear them clearly.Overly-agreeable artificial tone: Lastly, it’s still ChatGPT under the hood, so all the regular limitations of ChatGPT apply here. Most notably, ChatGPT is tuned to be overly-friendly and overly-agreeable (sounding like a customer service agent) so it will simply go along with whatever you assert. Thus, by default it will not be good at pushing back on you or challenging your thinking in any meaningful ways, just like how you wouldn’t expect a customer service agent to challenge what you say. Moreover, the overly-friendly tone of its responses could come off as insincere and almost sarcastic at times, even though that wasn’t the designers’ intent. Relatedly, it had a tendency to ask me superficial questions after it responds, which sound mildly condescending and break the flow of our chat, like, “Sooo, what do YOU think about the San Diego Zoo? What’s YOUR favorite part of the zoo?!?” … when a normal human wouldn’t break the conversational flow so awkwardly like that. Lastly, ChatGPT is trained on data on the public internet (and can also browse the web to get more updated web contents), so it won’t do as well if you’re asking about things that haven’t been discussed much online.

Related work from others:  Latest from MIT Tech Review - Five things you need to know about the EU’s new AI Act

To summarize the above limitations, chatting with ChatGPT on my phone felt like using a walkie-talkie over a noisy channel to talk to an overly-agreeable but socially-unaware customer service agent who has extensive knowledge about the contents of the public internet.

Parting Thoughts: Cautiously Optimistic About the Future

Despite these limitations, I’m excited to see what’s in store for future voice interfaces to LLM-based AI tools like ChatGPT. My early experiences of talking with ChatGPT while driving gave me a glimpse into what many of us have seen growing up in sci-fi shows such as Star Trek, where people can talk to an omnipresent computer to ask questions, hold conversations, or issue commands. Hands-free operation isn’t useful only while driving—it can make computing truly ubiquitous by letting us seamlessly interact with computation while we are in the midst of doing housework, cooking, or childcare; and it can make computing more accessible to broader groups of people, such as those with mobility impairments.

We still have a long way to go, though. Right now the ChatGPT iPhone app isn’t hooked up to external tools beside a basic web browser, but with the recently-announced GPT store (and likely upcoming LLM app stores from other companies) it will soon be possible to hook up LLMs to a variety of tools that can manage our emails, shopping lists, personal finances, home automation, and more. Recent research has started exploring these ideas by connecting ChatGPT to home assistants such as Amazon Alexa (2023 arXiv paper PDF). Another promising line of work is better context awareness: for instance, Meta and Ray-Ban recently announced new Smart Glasses which allow users to chat with an AI assistant that can see what they are seeing (review from The Verge). In my driving scenario, you could imagine wearing these glasses and having the AI act more like a passenger sitting alongside you in the car seeing what you see rather than someone on the other end of a phone call. Critically, a passenger can pause the conversation and tell you to watch the road more carefully if they see a possible danger ahead; a future AI powered by such smart glasses may be able to do the same thing. Alternatively, cars are now starting to directly embed AI into entertainment systems (e.g., Volkswagen announcement at CES 2024), so future iterations could integrate cameras and 3-D tracking to complement LLMs. One could also imagine smartglasses-based multimodal interactions where you point to objects in any physical environment and start conversations with the AI assistant about your surroundings (check out this MKBHD YouTube Short showing AI chat with smart glasses).

Related work from others:  Latest from MIT : MIT launches new Music Technology and Computation Graduate Program

Of course, these increasingly-intense levels of AI interaction and automation come with risks, such as user overreliance, unintended command execution, mental or physical health hazards, and security/privacy violations. Thus, it will be important to design ways to both manage those risks and educate users about how to safely operate these increasingly-powerful systems. Thank you very much for reading. Sooo, what do YOU think about ChatGPT’s voice mode?!? What’s YOUR favorite and least favorite parts?

Share via
Copy link
Powered by Social Snap