Beyond ChatGPT: How are AI Voices Made?

AI Voices are here to stay… Well, so they say… But how are AI voices made anyway?

It’s a hot topic these days. Many see AI voices as an existential threat to voice actors. To understand the role AI may play and the impact it could have on the industry, it’s important to understand how they actually work.

AI voices refer to synthetic voices generated by deep learning algorithms and neural networks – just like ChatGPT. Unlike traditional text-to-speech systems, which produce robotic-sounding speech lacking natural intonation, stress, and emotion, AI-generated voices have become increasingly sophisticated in imitating human speech patterns and inflexions.

In other words, AI voices sound more human-like and less robotic than their predecessors. This has made them attractive to businesses and individuals looking for cheaper ways to create voiceovers. But how is this even possible?

This article will delve into what Neural Networks are, how Deep Learning is used to create AI, and how AI Voices are made.

Deus Ex Machina: Imitating the human brain

The human mind remains a great mystery. It’s the most complex biological machine in the known universe, enabling us to learn, adapt, identify, and respond to virtually any context.

Our intelligence is deeply rooted in the countless layers of interconnected neurons that constitute our brain – providing the perfect model for developing artificial intelligence. Neural Networks are AI structures designed to mimic the human brain.

The basics of Machine Learning and Algorithms

Machine learning development and algorithms underpin the Internet. Computers rely on algorithms to make decisions, and modern algorithms are characterized by intricate logic and rules beyond human capacity. Simply put, building these algorithms is far too complex for humans to design.

This is where machine learning comes in. Machines can automate the process of matching user inputs to outputs, with incredible efficiency. For example, a Google search algorithm reacts to user inputs of a search query, which the algorithm processes to determine the results and their ranking and then displays results as the output.

The input and output are measured by the user, whereas the processing layer is hidden in the background – where the machine makes the magic happen.

Basic Diagram of Neural Network - Input layer, hidden processing layer, and output layer

As technology has advanced, machine learning development has become increasingly complex. Deep learning bots have become the standard for algorithms, with even simple Google searches dominated by neural networks.

These building blocks pave the way for AI voice generation.

Deep Learning

Deep Learning is a powerful method for training artificial Neural Networks to learn from data. By leveraging the interconnected nodes in the AI, the algorithms can excel in complex areas like natural language processing and realistic human speech generation. The more data fed to the AI, the more sophisticated its capability to do its job becomes.

For example, if an AI is designed to identify dogs, it may be fed pictures of dogs and cats. As the AI predicts whether a picture is of a dog, its answers are cross-referenced against data that provides the answers. The algorithms that are more correct become the template of the AI.

With each iteration, the AI becomes better at identifying what a dog looks like compared to a cat, incrementally enhancing its understanding of what a dog looks like – to the point where it can instantaneously delineate the answer with incredible efficiency.

This process is known as backpropagation, and it allows the neural network to learn and improve over time. Improving these neural networks requires immense amounts of data to hone its knowledge further, which is why companies are so focused on data collection.

Each iteration of the AI is held to a higher standard, forcing the neural network to advance incrementally against other versions – competing for the right to succeed the current bot (and to avoid destruction ☹).

Crucially, the AI is self-learning, meaning that the algorithmic tweaks are made by the AI itself based on its performance in tests. No one knows how these bots are wired. The AI can be influenced depending on new data, but the actual layers of relationships in the neural network are too complex – much like a human brain.

This is the same Deep Learning logic used in replicating voices.

YouTuber CGP Grey’s video on Deep Learning & Neural Networks captures this concept perfectly.

How AI Voices are Created

Neural Networks created through Deep Learning methodology are capable of artificially constructing voices, capturing the basic patterns of human speech. The AI sifts through immense amounts of data, countless hours of audio of human speech, to break down the vocal characteristics of how people speak. With enough training through analysis, the Neural Network has developed so well that it can replicate the slight intonations of speech with startling accuracy.

From here it’s just a matter of a user inputting the text they want to be spoken, and the AI will process this, matching it up with its previously mentioned database of speech behaviour, providing the output of audio.

The more data fed to the AI, the more capable the algorithm becomes at copying speech realistically.

Cloning Human Voices

AI voiceover’s threat to voice actors is about more than the popularity of these tools. Some AIs dedicated to constructing believable human voices also have the capability to deconstruct existing voices and incorporate them into the final product. That is to say, they can listen to a voice actor and use their voice as the AI’s voice – effectively making it incredibly easy to steal actors’ voices.

It goes without saying that this could be devastating to the voiceover industry. Voice actors use their voices as their unique selling point (USP) – their speech is their trade. Realistic mimicry of this could undermine countless actors’ abilities to work.

This is far from some futuristic possibility – it is happening now. Voice actors are having their voices taken and repackaged – to sell voice over at prices so low that no voice actor could realistically compete with them.

The very nature of a Deep Learning Neural Network means that this will get easier to do as the AI learns. Exposure to more data, in this case, speech, enhances the AI’s capacity to deconstruct a voice into its core vocal traits, resulting in the simulation of a person’s utterances being possible through mere seconds of audio input. This is why Voquent is constantly evolving to protect your data on our platform.

Even if these AI copies aren’t perfect now, they very well could be one day.

Frightening for those of us who rely on human voices.

Where Human Voices Shine

It’s not all doom and gloom! Voices are far more than just a coat of paint; they are the lifeblood of a message.

Whilst AI voices may be impressive with recent developments, it doesn’t mean that they are perfect, in fact, they are limited by their own definition. Artificiality: AI voices lack true human authenticity in all the facets that matter.

Emotion & Expression

At the end of the day, an AI lacks some crucial elements that make a human voice over unique.

Artificial voices have made strides in copying speech, but they are still heavily limited wherever emotion and expression are concerned. Emotion in vocal performances is incredibly subtle, with sadness and sorrow being woven into dialogue that may inspire or frighten – these emotions can blend together in how they are expressed, parameters that an AI may struggle to replicate.

Picking up on emotions without having it explicitly obvious how the character feels is critical for audiences; something only human voice actors can do – sorry robots…

After all, Artificial Intelligence isn’t sentient … Yet… 👀

Natural Variability

Even if an AI voiceover can become capable of applying human emotion to performance, understanding how and when to apply different degrees of various emotions to voiceovers is quite another level.

An algorithm may be well-attuned to applying unique vocal sounds – but that doesn’t account for the singular variability that each person has. No two people are the same, nor are the way they talk; There isn’t any AI developed that can map out the exact way that each voice actor speaks with ultimate precision.

Creativity / Interpretation

There are countless examples of voice actors owning a role to the point of becoming synonymous with the character. Breathing life into a role and capturing how a character may sound in different scenarios is something a writer or director may not be able to anticipate.

Workshopping different phrasing and becoming one with the character is part of the acting process. This isn’t currently possible by AI – which is completely input-output based, unable to consciously incorporate their non-existent personality.

Contextual Understanding

The context of a scene is crucial to inform a performance. AI voices can’t capture the nuances of a voice in emotionally complicated circumstances. Tapping into the core essence of a person is currently unique to actual people – currently, existing neural networks can’t achieve this level of refinement.

Conclusion

Tapping into the essence of humanity to adapt to new situations, is simply outside of the realm of AI. The adaptability of a person outweighs even the most sophisticated Neural Network by a long shot, while these voices may seem realistic, they are not real. Full stop.

A voice is more than just the scientific production of utterances to match the thoughts in a biological neural network. It is the expression of the essence of your being – the passing of your thoughts, feelings, and perception all rolled up into unpredictable speech patterns, everchanging to match evolving interactions.

Your voice is more than the sum of your parts – Voquent sees this and we will continue to champion human-centric voice over and its place in the industry.

By Michael Sum

Marketing Specialist and resident Content Monkey at Voquent. Michael has a lifelong passion for gaming media and bases his personality on whatever game he's currently playing.

More from this author

Ideas

Voice Over in Film Noir

By Michael Sum

7 November 2023

Voice-Over

Clutch Recognises Voquent as their Top Voiceover and Transla...

By Michael Sum

27 October 2023

Production

How Voquent’s Revamped Dubbing Solutions Will Multiply...

By Michael Sum

12 October 2023

Sometimes we include links to online retail stores such as Amazon. As an Amazon Associate, if you click on a link and make a
purchase, we may receive a small commission at no additional cost to you.