FreeTalk:A plug-and-play and black-box defense against speech synthesis attacks
- URL: http://arxiv.org/abs/2509.00561v1
- Date: Sat, 30 Aug 2025 17:10:22 GMT
- Title: FreeTalk:A plug-and-play and black-box defense against speech synthesis attacks
- Authors: Yuwen Pu, Zhou Feng, Chunyi Zhou, Jiahao Chen, Chunqiang Hu, Haibo Hu, Shouling Ji,
- Abstract summary: We propose a lightweight, robust, plug-and-play privacy preservation method against speech synthesis attacks.<n>Our method generates and adds a frequency-domain perturbation to the original speech to achieve privacy protection and high speech quality.
- Score: 40.22853425929116
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, speech assistant and speech verification have been used in many fields, which brings much benefit and convenience for us. However, when we enjoy these speech applications, our speech may be collected by attackers for speech synthesis. For example, an attacker generates some inappropriate political opinions with the characteristic of the victim's voice by obtaining a piece of the victim's speech, which will greatly influence the victim's reputation. Specifically, with the appearance of some zero-shot voice conversion methods, the cost of speech synthesis attacks has been further reduced, which also brings greater challenges to user voice security and privacy. Some researchers have proposed the corresponding privacy-preserving methods. However, the existing approaches have some non-negligible drawbacks: low transferability and robustness, high computational overhead. These deficiencies seriously limit the existing method deployed in practical scenarios. Therefore, in this paper, we propose a lightweight, robust, plug-and-play privacy preservation method against speech synthesis attacks in a black-box setting. Our method generates and adds a frequency-domain perturbation to the original speech to achieve privacy protection and high speech quality. Then, we present a data augmentation strategy and noise smoothing mechanism to improve the robustness of the proposed method. Besides, to reduce the user's defense overhead, we also propose a novel identity-wise protection mechanism. It can generate a universal perturbation for one speaker and support privacy preservation for speech of any length. Finally, we conduct extensive experiments on 5 speech synthesis models, 5 speech verification models, 1 speech recognition model, and 2 datasets. The experimental results demonstrate that our method has satisfying privacy-preserving performance, high speech quality, and utility.
Related papers
- Synthetic Voices, Real Threats: Evaluating Large Text-to-Speech Models in Generating Harmful Audio [63.18443674004945]
This work explores a content-centric threat: exploiting TTS systems to produce speech containing harmful content.<n>We present HARMGEN, a suite of five attacks organized into two families that address these challenges.
arXiv Detail & Related papers (2025-11-14T03:00:04Z) - RoVo: Robust Voice Protection Against Unauthorized Speech Synthesis with Embedding-Level Perturbations [5.777711921986914]
We propose RoVo, a novel proactive defense technique that injects adversarial perturbations into high-dimensional embedding vectors of audio signals.<n>RoVo effectively defends against speech synthesis attacks and also provides strong resistance to speech enhancement models.<n>A user study confirmed that RoVo preserves both naturalness and usability of protected speech.
arXiv Detail & Related papers (2025-05-19T04:14:58Z) - SafeSpeech: Robust and Universal Voice Protection Against Malicious Speech Synthesis [8.590034271906289]
Speech synthesis technology has brought great convenience, while the widespread usage of realistic deepfake audio has triggered hazards.<n>Malicious adversaries may unauthorizedly collect victims' speeches and clone a similar voice for illegal exploitation.<n>We propose a framework, textittextbfSafeSpeech, which protects the users' audio before uploading by embedding imperceptible perturbations on original speeches.
arXiv Detail & Related papers (2025-04-14T03:21:23Z) - Mitigating Unauthorized Speech Synthesis for Voice Protection [7.1578783467799]
malicious voice exploitation has brought huge hazards in our daily lives.
It is crucial to protect publicly accessible speech data that contains sensitive information, such as personal voiceprints.
We devise Pivotal Objective Perturbation (POP) that applies imperceptible error-minimizing noises on original speech samples.
arXiv Detail & Related papers (2024-10-28T05:16:37Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Anonymizing Speech: Evaluating and Designing Speaker Anonymization
Techniques [1.2691047660244337]
The growing use of voice user interfaces has led to a surge in the collection and storage of speech data.
This thesis proposes solutions for anonymizing speech and evaluating the degree of the anonymization.
arXiv Detail & Related papers (2023-08-05T16:14:17Z) - Anonymizing Speech with Generative Adversarial Networks to Preserve
Speaker Privacy [22.84840887071428]
Speaker anonymization aims for hiding the identity of a speaker by changing the voice in speech recordings.
This typically comes with a privacy-utility trade-off between protection of individuals and usability of the data for downstream applications.
We propose to tackle this issue by generating speaker embeddings using a generative adversarial network with Wasserstein distance as cost function.
arXiv Detail & Related papers (2022-10-13T13:12:42Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z) - Speaker De-identification System using Autoencoders and Adversarial
Training [58.720142291102135]
We propose a speaker de-identification system based on adversarial training and autoencoders.
Experimental results show that combining adversarial learning and autoencoders increase the equal error rate of a speaker verification system.
arXiv Detail & Related papers (2020-11-09T19:22:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.