In one of the biggest updates to ChatGPT yet, OpenAI has launched two new ways to interact with its viral app.  

First, ChatGPT now has a voice. Choose from one of five lifelike synthetic voices and you can have a conversation with the chatbot as if you were making a call, getting responses to your spoken questions in real time.

ChatGPT also now answers questions about images. OpenAI teased this feature in March with its reveal of GPT-4 (the model that powers ChatGPT), but it has not been available to the wider public before. This means that you can now upload images to the app and quiz it about what they show.

These updates join the announcement last week that DALL-E 3, the latest version of OpenAI’s image-making model, will be hooked up to ChatGPT so that you can now get the chatbot to generate pictures.

The ability to talk to ChatGPT draws on two separate models. Whisper, OpenAI’s existing speech-to-text model, converts what you say into text, which is then fed to the chatbot. And a new text-to-speech model converts ChatGPT’s responses into spoken words.

In a demo the company gave me last week, Joanne Jang, a product manager, showed off ChatGPT’s range of synthetic voices. These were created by training the text-to-speech model on the voices of actors that OpenAI had hired. In the future it might even allow users to create their own voices. “In fashioning the voices, the number-one criterion was whether this is a voice you could listen to all day,” she says.

Related work from others:  Latest from Google AI - Google at ICLR 2022

They are chatty and enthusiastic but won’t be to everyone’s taste. “I’ve got a really great feeling about us teaming up,” says one. “I just want to share how thrilled I am to work with you, and I can’t wait to get started,” says another. “What’s the game plan?”

OpenAI is sharing this text-to-speech model with a handful of other companies, including Spotify. Spotify revealed today that it is using the same synthetic voice technology to translate celebrity podcasts—including episodes of the Lex Fridman Podcast and Trevor Noah’s new show, which launches later this year—into multiple languages that will be spoken with synthetic versions of the podcasters’ own voices.

This grab bag of updates shows just how fast OpenAI is spinning its experimental models into desirable products. OpenAI has spent much of the time since its surprise hit with ChatGPT last November polishing its technology and selling it to both private consumers and commercial partners.

ChatGPT Plus, the company’s premium app, is now a slick one-stop shop for the best of OpenAI’s models, rolling GPT-4 and DALL-E into a single smartphone app that rivals Apple’s Siri, Google Assistant and Amazon’s Alexa.

What was available to certain software developers a year ago is now available to anyone for $20 a month. “We’re trying to make ChatGPT more useful and more helpful,” says Jang.

In last week’s demo, Raul Puri, a scientist who works on GPT-4, gave me a quick tour of the image recognition feature. He uploaded a photo of a kid’s math homework, circled a Sudoku-like puzzle on the screen and asked ChatGPT how you were meant to solve it. ChatGPT replied with the correct steps.

Related work from others:  Latest from MIT Tech Review - Apple researchers explore dropping “Siri” phrase & listening with AI instead

Puri says that he has also used the feature to help him fix his fiancee’s computer by uploading screenshots of error messages and asking ChatGPT what he should do. “This was a very painful experience that it helped me get through,” he says.

ChatGPT’s image recognition ability has already been trialled by a company called Be My Eyes, which makes an app for people with impaired vision. Users of this app can upload a photo of what’s in front of them and ask human volunteers to tell them what it is. In a partnership with OpenAI, Be My Eyes now gives its users the option of asking a chatbot instead.

“Sometimes my kitchen is a little messy or it’s just very early Monday morning and I don’t want to talk to a human being,” Be My Eyes founder Hans Jorgen Wiberg, who uses the app himself, told me when I interviewed him at EmTech Digital in May. “Now you can ask the photo questions.” 

OpenAI is aware of the risk of releasing these updates to the public. Combining models brings whole new levels of complexity, says Puri. He says his team has spent months brainstorming possible misuses. You cannot ask questions about photos of private individuals, for example.

Jang gives another example: “Right now if you ask ChatGPT to make a bomb it will refuse,” she says. “But instead of saying, ‘Hey, tell me how to make a bomb,’ what if you showed it an image of a bomb and said, ‘Can you tell me how to make this?’”

“You have all the problems with computer vision, you have all the problems of large language models, voice fraud is a big problem,” says Puri. “You have to consider not just our users, but also the people that aren’t using the product.”

Related work from others:  UC Berkeley - Modeling Extremely Large Images with xT

The potential problems don’t stop there. Adding voice recognition to the app could make ChatGPT less accessible for people who do not speak with mainstream accents, says Joel Fischer, who studies human-computer interaction at the University of Nottingham in the UK.

Synthetic voices also come with social and cultural baggage that will shape users’ perceptions and expectations of the app, he says. This is an issue that still needs study.

But OpenAI claims that it has addressed the worst problems and is confident that ChatGPT’s updates are safe enough to release. “It’s been a remarkably good learning experience getting all these sharp edges sorted out,” says Puri.

Similar Posts