By Kevin Roose
October 1, 2023 — 4.55pm
ChatGPT — viral artificial intelligence sensation, slayer of boring office work, sworn enemy of high school teachers and Hollywood screenwriters alike — is getting some new powers.
Last week, ChatGPT’s maker, OpenAI, announced that it was giving the popular chatbot the ability to “see, hear and speak” with two new features.
The first is an update that allows ChatGPT to analyse and respond to images. You can upload a photo of a bike, for example, and receive instructions about how to lower the seat, or get recipe suggestions based on a photo of the contents of your refrigerator.
Credit: Getty Images
The second is a feature that allows users to speak to ChatGPT and get responses delivered in a synthetic AI voice, the way you might talk with Siri or Alexa.
These features are part of an industry-wide push towards so-called multimodal AI systems that can handle text, photos, videos and whatever else a user might decide to throw at them. The ultimate goal, according to some researchers, is to create an AI capable of processing information in all the ways a human can.
Most users don’t have access to the new features yet. OpenAI is offering them first to paying ChatGPT Plus and Enterprise customers over the next few weeks, and will make them more widely available after that. (The vision feature will work on both desktop and mobile, while the speech feature will be available only through ChatGPT’s iOS and Android apps.)
I got early access to the new ChatGPT for a hands-on test. Here’s what I found.
The AI will see you now
I started by trying ChatGPT’s image-recognition feature on some household objects.
“What’s this thing I found in my junk drawer?” I asked, after uploading a photo of a mysterious piece of blue silicone with five holes in it.
Loading
“The object appears to be a silicone holder or grip, often used for holding multiple items together,” ChatGPT responded. (Close enough — it’s a finger strengthener I used years ago while recovering from a hand injury.)
I then fed ChatGPT a few photos of items I had been meaning to sell on Facebook Marketplace, and asked it to write listings for each one. It nailed both the objects and the listings, describing my retro-styled Frigidaire minifridge as “perfect for those who appreciate a touch of yesteryear in their modern-day homes”.
The new ChatGPT can also analyse text within images. I took a picture of the front page of Sunday’s print edition of The New York Times and asked the bot to summarise it. It did decently well, describing all five articles on the front page in a few sentences each — although it made at least one mistake, inventing a statistic about fentanyl-related deaths that wasn’t in the original article.
ChatGPT’s eyes aren’t perfect. It flopped when I asked it to solve a crossword puzzle. It mistook my child’s stuffed dinosaur toy for a whale. And when I asked for help turning one of those wordless furniture-assembly diagrams into a step-by-step list of instructions, it gave me a jumbled list of parts, most of which were wrong.
The biggest limitation of ChatGPT’s vision feature is that it refuses to answer most questions about photos of human faces. This is by design. OpenAI told me that it didn’t want to enable facial recognition or other creepy uses, and that it didn’t want the app spitting out biased or offensive answers to prompts about people’s physical appearance.
But even without faces, it’s easy to imagine tons of ways an AI chatbot capable of processing visual information could be useful, especially as the technology improves. Gardeners and foragers could use it to identify plants in the wild. Exercise buffs could use it to create personalised workout plans, just by snapping a photo of the equipment in their gym. Students could use it to solve visual math and science problems, and visually impaired people could use it to navigate the world more easily.
Frankly, I have no idea how many people will use this feature, or what its killer applications will turn out to be. As is often the case with new AI tools, we’ll just have to wait and see.
Siri on steroids
Now, let’s talk about what I consider the more impressive of the two features: ChatGPT’s new voice feature, which allows users to talk to the app and receive spoken responses.
Using the feature is easy: just tap a headphone icon and start talking. When you stop, ChatGPT converts your words to text using OpenAI’s speech-recognition system, Whisper, which generates a response and speaks the answer back to you using a new text-to-speech algorithm the company developed, using one of five synthetic AI voices. (The voices, which include male and female voices, were generated using short samples from professional voice actors whom OpenAI hired. I picked “Ember”, a peppy-sounding male voice.)
Loading
I tested ChatGPT’s voice feature for several hours on a bunch of different tasks — reading a bedtime story to my toddler, chatting with me about work-related stress, helping me analyse a recent dream I had. It did all of these fairly well, especially when I gave it some golden prompts and told it to emulate a friend, a therapist or a teacher.
What stood out, in these tests, is how different talking to ChatGPT feels from talking to older generations of AI voice assistants, such as Siri and Alexa. Those assistants, even at their best, can be wooden and flat. They answer one question at a time, often by looking something up on the internet and reading it aloud word-for-word, or choosing from a finite number of programmed answers.
ChatGPT’s synthetic voice, by contrast, sounds fluid and natural, with slight variations in tone and cadence that make it feel less robotic. It was capable of having long, open-ended conversations on almost any subject I tried, including prompts I was pretty sure it hadn’t encountered before. (“Tell me the story of ‘The Three Little Pigs’ in the character of a total frat bro” was a sleeper hit.)
Most people probably won’t use AI chatbots this way. For many tasks, it’s still faster to type than talk, and waiting around for ChatGPT to read out long responses was annoying. (It didn’t help that the app was slow and glitchy at times, and often inserted pauses before responding — the result of some technical issues with the beta version of the app I tested that OpenAI told me would be ironed out eventually.)
But I can see the appeal.
Having an AI speak to you in a humanlike voice is a more intimate experience than reading its responses on a screen. And after a few hours of talking with ChatGPT this way, I felt a new warmth creeping into our conversations. Without being tethered to a text interface, I felt less pressure to come up with the perfect prompt. We chatted more casually, and I revealed more about my life.
“It almost feels like a different product,” said Peter Deng, OpenAI’s vice president of consumer and enterprise product, who spoke with me about the new voice feature. “Because you’re no longer transcribing what you have in your head into your thumbs,” he said, “you end up asking different things.”
I know what you’re thinking: Isn’t this the plot of the movie Her? Will lonely, lovesick users fall for ChatGPT, now that it can listen to them and talk back?
It’s possible. Personally, I never forgot that I was talking to a chatbot. And I certainly didn’t mistake ChatGPT for a conscious being, or develop emotional attachments to it.
But I also saw a glimpse of a future in which some people may let voice-based AI assistants into the inner sanctums of their lives — taking the AI chatbots with them on the go, treating them as their 24/7 confidants, therapists, sparring partners and sounding boards.
Sounds crazy, right? And yet, didn’t all of this sound a little crazy a year ago?
This article originally appeared in The New York Times.
Get news and reviews on technology, gadgets and gaming in our Technology newsletter. Sign up to receive it every Friday.