I spoke with an AI version of myself, thanks to Hume’s free tool – how to try it

ai voice concept

Chiken Brave/Getty Images

If you’ve ever had the urge to converse with an AI version of yourself, now you can — kind of. 

On Thursday, AI start-up Hume announced the launch of a new “hyperrealistic voice cloning” feature for the latest iteration of its Empathic Voice Interface (EVI) model, EVI 3, which was unveiled last month. The idea is that by uploading a short audio recording of yourself speaking — ideally between 30 and 90 seconds — the model should be able to quickly churn out an AI-generated replica of your voice, which you can then interact with verbally, just as you would with another person standing in front of you. 

Also: Text-to-speech with feeling – this new AI model does everything but shed a tear

I uploaded a recording of my voice to EVI 3 and spent some time idly chatting with the model’s imitation of my voice. I was hoping (perhaps naively) to have an Uncanny Valley experience — that exceedingly rare feeling of interacting with something that feels almost completely real, yet off-kilter enough to make one feel slightly uneasy — and was disappointed when the EVI 3 me was more like an audio cartoon version of myself. 

Let me unpack that a bit.

Using EVI 3’s voice cloning feature

Voice cloning using Hume

Screenshot by Webb Wright/ZDNET

The imitation of my voice was, in some ways, undeniably realistic. It seemed to pause intermittently when speaking in more or less the same way that I tend to do, with a touch of familiar vocal fry. But the mirroring stopped there.

Hume claims in its blog post that EVI 3’s new voice cloning feature can capture “aspects of the speaker’s personality.” This is a vague promise (probably intentionally so), but in my own trials, the model seemed to fall short in this regard. Far from feeling like a convincing simulation of my own behavior quirks and sense of humor, the model spoke with a chipper, eager-to-please tone that would’ve been well-suited to a radio ad for antidepressants. I like to think of myself as being friendly and generally upbeat, but the AI was obviously exaggerating those particular character traits.

Also: Fighting AI with AI, finance firms prevented $5 million in fraud – but at what cost?

Despite its generally puppy-like demeanor, the model was strangely staunch in its refusal to try speaking in an accent, which seemed to me like it would be the kind of playful voice exercise that it would excel at. When I asked it to give an Australian accent a whirl, it said “g’day” and “mate,” once or twice in my normal voice, then immediately shied away from anything more daring. And no matter what I prompted it to speak about, it tended to find some creative and roundabout way to circle back to the topic I was discussing when I recorded my voice as a sample for it to use, reminiscent of an experiment from Anthropic last year in which Claude was tweaked to become obsessed with the Golden Gate Bridge.

In my second trial, for example, I had recorded myself speaking about Led Zeppelin, which I’d been listening to earlier that morning. When I then asked EVI 3’s voice clone of myself to elucidate its thoughts on the nature of dark matter, it quickly found a way to bring its response back to the subject of music, comparing the mysteriously invisible force pervading the cosmos with the intangible melody that imbues a song with meaning and power.

You can try EVI 3’s new voice cloning feature for yourself here.

According to Hume’s website, user data produced from interactions with the EVI API are collected and anonymized by default in order to train the company’s models. You can turn this off, however, through the “Zero data retention” feature in your profile. For non-API products, including the demo linked above, the company says it “may” collect and use data to improve its models—but again, you can toggle this off if you create a personal profile. 

Whispering robots

AI voices have been around for quite a while, but they’ve historically been rather limited in their realism; it’s very obvious you’re talking to a robot when you receive responses from classic Siri or Alexa, for example. In contrast, a new wave of AI voice models, EVI 3 among them, have been engineered not only to speak in natural language but also, and more importantly, to mimic the subtle inflections, intonations, idiosyncrasies, and cadences that inflect real, everyday human speech.

“A big part of human communication is emphasizing the right words, pausing at the right times, using the right tone of voice,” Hume CEO and chief scientist Alan Cowen told me.

As Hume wrote in a blog post on Thursday, EVI 3 “knows what words to emphasize, what makes people laugh, and how accents and other voice characteristics interact with vocabulary.” According to the company, this marks a major technical leap forward from earlier speech-generating models, “which lack a meaningful understanding of language.”

Many AI experts would take umbrage with the use of words like “understanding” in this context since models like EVI 3 are trained merely to detect and recreate patterns gleaned from their voluminous swathes of training data, a process that arguably doesn’t leave any room for what we’d recognize as true semantic comprehension.

Also: ChatGPT isn’t just for chatting anymore – now it will do your work for you

EVI 3 was trained “on trillions of tokens of text and then millions of hours of speech,” according to Hume’s blog post. According to Cowen, this approach alone has enabled the model to speak in voices that are much more realistic than would intuitively be expected. “With voice [models], what’s been most surprising is how human [they] can be just by training on a lot of data,” he said. 

But philosophical arguments aside, the new wave of AI voice models is uncontroversially impressive. When prompted, they can explore a much vaster range of vocal expression than their predecessors. Companies like Hume and ElevenLabs claim that these new models will have practical benefits for industries like entertainment and marketing, but some experts fear that they’ll open new doors for deception — as was illustrated just last week when an unknown person used AI to imitate the voice of US Secretary of State Marco Rubio and subsequently deployed the voice clone in an attempt to dupe government officials.

“I don’t see any reason that we would need a robot whispering,” Emily M. Bender, a linguist and coauthor of The AI Con, recently told me. “Like, what’s that for? Except maybe to disguise the fact that what you’re listening to is synthetic?”

Revolutionary becomes routine

Yes, EVI 3’s voice cloning feature, like all AI tools, has its shortcomings. But those are significantly overshadowed by its remarkable qualities.

For one thing, we should remember that the generative AI models hitting the market today are part of the technology’s infancy, and they’ll only continue to improve. In less than three years, we’ve gone from the public release of ChatGPT to AI models that can more or less convincingly simulate real human voices and tools like Google’s Veo 3, which can produce realistic video and synchronized audio. The breathtaking pace of generative AI advancements should give us pause, to say the least.

Also: AI agents will change work and society in internet-sized ways, says AWS VP

Today, EVI 3 can simulate a rough approximation of your voice. It’s not unreasonable to expect, however, that its successor — or perhaps grand-successor — will be able to capture your voice in a way that feels truly convincing. In such a world, one can imagine EVI or a similar voice-generating model being paired with an AI agent to, say, join Zoom meetings on your behalf. It could also, less optimistically, be a scam artist’s dream come true. 

Perhaps the most striking fact about my experience interacting with EVI 3’s voice cloning feature, though, is how mundane this technology already feels. 

As the pace of technological innovation accelerates, so too does our capacity for instantaneously normalizing that which would have stunned previous generations of humans into awestruck silence. OpenAI’s Sam Altman made this very point in a recent blog post: According to Altman, we’re approaching the Singularity, yet for the most part, it feels like business as usual.

Want more stories about AI? Sign up for Innovation, our weekly newsletter.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top