AI voice clones are easier to understand in noisy environments than real humans

Artificial intelligence voice clones tend to be easier to understand in noisy environments than the actual human voices they mimic. This finding provides evidence that synthetic speech technology could significantly improve assistive communication devices for individuals with speech impairments. The research was published in The Journal of the Acoustical Society of America.

Synthetic voices are increasingly a part of daily life, ranging from digital assistants like Siri and Alexa to automated telemarketers and answering machines. With the expansion of generative artificial intelligence, voice clones have emerged as a new type of synthetic speech. Traditional synthetic voices require a voice actor to spend hours in a recording booth. In contrast, artificial intelligence can generate a highly realistic voice clone based on just 10 seconds of recorded audio. This minimal requirement significantly expands the number of potential voices and applications.

People often worry about the societal risks of this technology, such as deepfake audio used for fraud or misinformation. However, the potential benefits of personalized voice synthesis for medical and communication purposes have received less attention. Individuals facing degenerative conditions like Parkinson’s disease or recovering from throat cancer often rely on computers to speak for them. Having a personalized artificial voice helps them retain their personal identity. These assistive devices are most useful when the people around the user can easily understand the generated speech.

Patti Adank, a researcher at University College London, and Han Wang, a researcher at the University of Roehampton, specialize in studying human perception of unclear speech. They were fascinated by the idea of machine-replicated speech and wanted to know how easy these clones are for the average person to understand. Natural voices vary widely in how easy they are to understand due to things like speaking speed, slight hoarseness, or heavy regional accents. The researchers suspected that voice clones would be poor representations of actual human voices and that people would struggle to understand them.

“I thought initially that voice clones would be less intelligible because they were unfamiliar,” Adank said. “I found they were up to 20% more intelligible, which was quite shocking. A small part of our paper is talking about that experiment, and then a large part is me and my collaborator frantically trying to find out what it is that makes those voice clones more intelligible.”

To test the initial intelligibility of these voices, the scientists set up an online experiment with 80 participants. The sample included 40 men and 40 women between the ages of 18 and 35. All participants were native speakers of British English living in the United Kingdom who wore wired headphones to ensure optimal sound quality during the testing.

The scientists started with an existing database of 10 human voices from different regions of England. They extracted about 348 seconds of speech for each person. They fed these short audio clips into ElevenLabs, a popular artificial intelligence voice generation program. This process created 10 fully artificial voice clones that matched the original human speakers.

The authors then generated 80 distinct test sentences designed to evaluate hearing and comprehension. Half of the sentences were spoken by the original humans, and the other half were generated by the artificial intelligence. The researchers mixed all the audio clips with a background sound called speech-shaped noise. This type of noise resembles continuous static and effectively masks the sound of the human voice.

Google News Preferences Add PsyPost to your preferred sources

The background static was presented at four distinct volume levels. These levels ranged from louder than the voice to quieter than the voice. Participants listened to the sentences and typed out the exact words they heard. The researchers scored the typed responses to measure how well the listeners comprehended the spoken words in the presence of the static.

The cloned voices provided a massive advantage for listeners. Participants correctly identified words from the artificial voices 67.5 percent of the time. When listening to the actual human voices, their accuracy dropped to just 54.1 percent. This 13.4 percentage point advantage for the cloned speech remained consistent across all four levels of background noise.

According to the press release accompanying the study, the duo repeated this experiment with different groups to see if the advantage held up. They tested elderly volunteers to determine if being hard of hearing alters the effect. They tested American volunteers to judge if the British accents played a role. They even used a filter designed to mimic cochlear implants. In every case, the voice clones emerged victorious.

Returning to the main study group, participants also completed two subjective rating tasks for the voices. They rated how distinct or sharp each voice sounded on a scale from one to seven. They also rated the strength of each speaker’s regional accent on a similar seven-point scale. Listeners judged the artificial clones to be significantly more distinct than the human originals, while also rating the clones as having a slightly stronger regional accent.

The researchers also wanted to know if people could tell the difference between the real and artificial audio. They asked participants to listen to pairs of identical sentences and pick out the actual human. Listeners identified the real human correctly 70.4 percent of the time. This suggests that while the artificial copies are highly intelligible, they still contain slightly unnatural qualities that give them away as computer-generated.

To figure out why the clones were easier to understand, the scientists analyzed 47 different acoustic properties of the audio files. They used computer software to measure traits like pitch, speech speed, and the harmonic richness of the sound. Pitch refers to the highness or lowness of a sound. Harmonics are the overlapping frequencies that give a voice its unique texture and resonance.

They looked at specific vocal instability markers known as jitter and shimmer. Jitter measures the tiny, involuntary changes in pitch that occur naturally when a human breathes and speaks. Shimmer measures the microscopic variations in loudness from moment to moment. The analysis revealed that the artificial voices lacked these natural micro-fluctuations, resulting in a much smoother, stabilized sound profile.

The statistical models showed a division in how the brain processed the two types of audio. For human audio, comprehension relied on formant measures. Formants are the concentrated bands of acoustic energy created by the physical shape of a person’s vocal tract. Listeners depended on these physical mouth-shape cues to decode the human words.

For the cloned voices, listener comprehension depended mostly on the overall pitch and the smooth harmonic structure. The artificial intelligence appears to boost intelligibility by amplifying the broad, structural elements of the sound. It prioritizes these smooth, stable sound waves rather than copying the exact mouth movements of the original speaker. This acoustic stabilization likely makes it easier for the human brain to separate the voice from the background static.

The study has a few limitations that warrant future exploration. The experiment used highly structured, pre-written sentences that do not mimic natural daily conversations. People tend to speak much more casually in real life, which might change how well the artificial intelligence captures their vocal patterns. The authors suggest that future studies should test conversational speech rather than read sentences.

The main experiment only tested the voices against one specific type of steady static noise. Real-world environments contain many different types of auditory distractions. Future research should evaluate how well these artificial copies perform when mixed with the sound of a busy restaurant or multiple competing talkers. Scientists could also deliberately manipulate the artificial voice settings to see if adding or removing vocal roughness changes listener comprehension.

After examining over 100 acoustic measurements to understand the intelligibility gap, Adank noted that they plan to collaborate with text-to-speech experts to adapt an open-source cloning system for future testing.

“I am now going to try and recreate [the effect] by studying how synthesizers work and how they use digital signal processing to generate those voices, just to get a bit of a handle on this,” Adank said.

The findings highlight a fascinating psychological phenomenon often called the uncanny valley. The computer-generated voices were mathematically optimized for easy hearing, yet listeners still noticed something slightly artificial about them. As the technology improves, developers will need to balance making voices easy to hear with making them sound authentically human. A perfectly smooth voice might be easy to understand, but it might lack the emotional warmth of a real person.

These findings offer immense promise for medical and assistive technologies. Individuals suffering from diseases that steal their ability to speak could bank their voices using artificial intelligence. The resulting communication devices might actually make it easier for them to converse in noisy environments than their original physical voices ever did. Hearing aids could also incorporate this technology to process and enhance incoming speech for the wearer.

The study, “Voice clones are easier to understand in noise than their human originals: The voice cloning intelligibility benefit,” was authored by Patti Adank and Han Wang.

AI voice clones are easier to understand in noisy environments than real humans

Trending

Science of Money

Welcome Back!

Retrieve your password

Add New Playlist