How To Keep AI From Stealing the Sound of Your Voice

A new technology called AntiFake prevents the theft of the sound of your voice by making it more difficult for AI tools to analyze vocal recordings

By Silke Hahn & Gary Stix

Robotic hand holding glass dome with a sample of voice. — Credit:
Moor Studio/Getty Images

Advances in generative artificial intelligence have enabled authentic-sounding speech synthesis to the point that a person can no longer distinguish whether they are talking to another human or a deepfake. If a person’s own voice is “cloned” by a third party without their consent, malicious actors can use it to send any message they want.

This is the flip side of a technology that could be useful for creating digital personal assistants or avatars. The potential for misuse when cloning real voices with deep voice software is obvious: synthetic voices can easily be abused to mislead others. And just a few seconds of vocal recording can be used to convincingly clone a person’s voice. Anyone who sends even occasional voice messages or speaks on answering machines has already provided the world with more than enough material to be cloned.

Computer scientist and engineer Ning Zhang of the McKelvey School of Engineering at Washington University in St. Louis has developed a new method to prevent unauthorized speech synthesis before it takes place: a tool called AntiFake. Zhang gave a presentation on it at the Association for Computing Machinery’s Conference on Computer and Communications Security in Copenhagen, Denmark, on November 27.

Conventional methods for detecting deepfakes only take effect once the damage has already been done. AntiFake, on the other hand, prevents the synthesis of voice data into an audio deepfake. The tool is designed to beat digital counterfeiters at their own game: it uses techniques similar to those employed by cybercriminals for voice cloning to actually protect voices from piracy and counterfeiting. The source text of the AntiFake project is freely available.

The antideepfake software is designed to make it more difficult for cybercriminals to take voice data and extract the features of a recording that are important for voice synthesis. “The tool uses a technique of adversarial AI that was originally part of the cybercriminals’ toolbox, but now we’re using it to defend against them,” Zhang said at the conference. “We mess up the recorded audio signal just a little bit, distort or perturb it just enough that it still sounds right to human listeners”—at the same time making it unusable for training a voice clone.

Similar approaches already exist for the copy protection of works on the Internet. For example, images that still look natural to the human eye can have information that isn’t readable by machines because of invisible disruption to the image file.

Software called Glaze, for instance, is designed to make images unusable for the machine learning of large AI models, and certain tricks protect against facial recognition in photographs. “AntiFake makes sure that when we put voice data out there, it’s hard for criminals to use that information to synthesize our voices and impersonate us,” Zhang said.

Attack methods are constantly improving and becoming more sophisticated, as seen by the current increase in automated cyberattacks on companies, infrastructure and governments worldwide. To ensure that AntiFake can keep up with the constantly changing environment surrounding deepfakes for as long as possible, Zhang and his doctoral student Zhiyuan Yu have developed their tool in such a way that it is trained to prevent a broad range of possible threats.

Zhang’s lab tested the tool against five modern speech synthesizers. According to the researchers, AntiFake achieved a protection rate of 95 percent, even against unknown commercial synthesizers for which it was not specifically designed. Zhang and Yu also tested the usability of their tool with 24 human test participants from different population groups. Further tests and a larger test group would be necessary for a representative comparative study.

Ben Zhao, a professor of computer science at University of Chicago, who was not involved in AntiFake’s development, says that the software, like all digital security systems, will never provide complete protection and will be menaced by the persistent ingenuity of fraudsters. But, he adds, it can “raise the bar and limit the attack to a smaller group of highly motivated individuals with significant resources.”

“The harder and more challenging the attack, the fewer instances we’ll hear about voice-mimicry scams or deepfake audio clips used as a bullying tactic in schools. And that is a great outcome of the research,” Zhao says.

AntiFake can already protect shorter voice recordings against impersonation, the most common means of cybercriminal forgery. The creators of the tool believe that it could be extended to protect larger audio documents or music from misuse. Currently, users would have to do this themselves, which requires programming skills.

Zhang said at the conference that the intent is to fully protect voice recordings. If this becomes a reality, we will be able to exploit a major shortcoming in the safety-critical use of AI to fight against deepfakes. But the methods and tools that are developed must be continuously adapted because of the inevitability that cybercriminals will learn and grow with them.

This article originally appeared in Spektrum der Wissenschaft and was reproduced with permission.