The Real Danger of Voice Cloning ‘Deepfakes’ and How to Detect Them”

Ismail R.
4 min readJan 24, 2024

--

One of the most striking technological uses depicted in Mission Impossible movies is the voice changer. Ethan Hunt (Tom Cruise) wears a mask with the victim’s face, speaks naturally with the impersonated voice, and successfully passes as them. We are getting closer to this reality, thanks to artificial intelligence (AI), at least in terms of copying voices. There is a difference. We don’t always mimic voices with the noble goal of fighting against evil, as seen in movies. Sometimes it’s to deceive verbally. These are voice cloning deepfakes.

Voice cloning for criminal purposes has been happening for a few years, and lately, it’s becoming more alarming because it’s getting easier, and the truth is that no one is safe.

One of the earliest significant cases of voice cloning deepfake occurred in 2020. The victim was the director of a Japanese company’s branch in Hong Kong. He received a phone call from someone claiming to be the company’s director. Everything seemed normal. The branch director recognized the voice of his superior, so he had no hesitation in following all the instructions given. As a result, he transferred 35 million dollars to the scammers.

That happened three years ago. Now, it’s more common. In the spring of this year, Florida investor Clive Kabatznik fell victim to a similar attempt. In Canada, grandparents received a call supposedly from their grandson. He alarmed them, claiming to be in jail and in need of cash to pay bail. Fortunately, in both cases, the attempts were detected, and the scams could not be consummated. But the risk exists because the technology is at our fingertips.

Three seconds are enough to clone a voice

Currently, voice cloning is not exactly like in Mission Impossible. It doesn’t transform one person’s voice into another’s; instead, it reads a text with a particular person’s voice. This is known as text-to-speech synthesis (TTS), and its operation is based on identifying voice patterns. We all speak in a unique way, and that’s why we recognize each person’s voice. Voice cloning utilizes neural networks trained to recognize these identifying patterns of each voice and then reproduce them when reading any text.

A significant example of voice cloning AI is Microsoft’s VALL-E. Its neural network has been trained with over 60,000 hours of audio in English from 7,000 different people. Its power lies in the fact that, to clone a voice, it only needs three seconds of recording. There is also the VALL-E (X) version, which allows voice cloning in a language different from the original. For now, Microsoft does not openly provide this technology, but all signs point to it being available soon.

However, on the internet, many applications allow voice cloning in a straightforward manner. All it takes is 30 seconds of recording or reading a short text provided by the application.

These applications do not emerge with the purpose of ‘voice fraud.’ Their goal is to offer particular voices, including one’s own, for activities such as video animation, avatars in video games, creating parodies, or marketing actions. There are also applications with social purposes, such as reading texts for people with dyslexia. Cloning the voice is not inherently criminal; it depends on the purpose each individual has.

How to avoid deception

In all cases, these applications warn of possible fraudulent uses, although they delegate all responsibility for inappropriate use to the user. Before using these types of applications, it is advisable to read the legal conditions. They are not usually written in a friendly language, but they determine our responsibilities and concessions as users. Therefore, be vigilant about what data the application collects, in addition to the recording of our voice, and for what purposes they will use them. Also, keep in mind that what you post may be accessible to third parties, outside the privacy policy of the application itself.

Another point of attention is using a voice without permission — it could be yours, don’t forget. Something similar has already happened in the case of image usage. Clearview AI trained its facial recognition system with 30 billion images taken from social networks without the owners’ consent. This means that any voice uploaded to social networks can be used to train other AIs or to be cloned.

Unfortunately, we are not good at identifying cloned voices. One way to identify if our voice has been cloned with AI is to use AI itself. There are AI applications that allow identifying voice cloning. But it is possible that we may not always have access to this technology.

Other options, more accessible, are based on the natural response to the intuition of a scam: verifying with third parties if a suspicious recording can be from its owner or not; contacting the supposedly impersonated person through another means; or asking or commenting something to the suspicious interlocutor that only the true person knows. Remember that the voice is cloned, but not the person (at least, not yet).

--

--

Ismail R.
Ismail R.

Written by Ismail R.

Early passion for computers led to a professional focus on aligning business with IT. Balancing academic and practical experience, especially in cybersecurity.

No responses yet