"Hello, It's Me": Deep Learning-based Speech Synthesis Attacks in the
Real World
- URL: http://arxiv.org/abs/2109.09598v1
- Date: Mon, 20 Sep 2021 14:53:22 GMT
- Title: "Hello, It's Me": Deep Learning-based Speech Synthesis Attacks in the
Real World
- Authors: Emily Wenger, Max Bronckers, Christian Cianfarani, Jenna Cryan, Angela
Sha, Haitao Zheng, Ben Y. Zhao
- Abstract summary: Advances in deep learning have introduced a new wave of voice synthesis tools, capable of producing audio that sounds as if spoken by a target speaker.
This paper documents efforts and findings from a comprehensive experimental study on the impact of deep-learning based speech synthesis attacks on both human listeners and machines.
We find that both humans and machines can be reliably fooled by synthetic speech and that existing defenses against synthesized speech fall short.
- Score: 14.295573703789493
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advances in deep learning have introduced a new wave of voice synthesis
tools, capable of producing audio that sounds as if spoken by a target speaker.
If successful, such tools in the wrong hands will enable a range of powerful
attacks against both humans and software systems (aka machines). This paper
documents efforts and findings from a comprehensive experimental study on the
impact of deep-learning based speech synthesis attacks on both human listeners
and machines such as speaker recognition and voice-signin systems. We find that
both humans and machines can be reliably fooled by synthetic speech and that
existing defenses against synthesized speech fall short. These findings
highlight the need to raise awareness and develop new protections against
synthetic speech for both humans and machines.
Related papers
- Syn-Att: Synthetic Speech Attribution via Semi-Supervised Unknown
Multi-Class Ensemble of CNNs [1.262949092134022]
Novel strategy is proposed to attribute a synthetic speech track to the generator that is used to synthesize it.
The proposed detector transforms the audio into log-mel spectrogram, extracts features using CNN, and classifies it between five known and unknown algorithms.
The method outperforms other top teams in accuracy by 12-13% on Eval 2 and 1-2% on Eval 1, in the IEEE SP Cup challenge at ICASSP 2022.
arXiv Detail & Related papers (2023-09-15T04:26:39Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - Deepfake audio detection by speaker verification [79.99653758293277]
We propose a new detection approach that leverages only the biometric characteristics of the speaker, with no reference to specific manipulations.
The proposed approach can be implemented based on off-the-shelf speaker verification tools.
We test several such solutions on three popular test sets, obtaining good performance, high generalization ability, and high robustness to audio impairment.
arXiv Detail & Related papers (2022-09-28T13:46:29Z) - Building African Voices [125.92214914982753]
This paper focuses on speech synthesis for low-resourced African languages.
We create a set of general-purpose instructions on building speech synthesis systems with minimum technological resources.
We release the speech data, code, and trained voices for 12 African languages to support researchers and developers.
arXiv Detail & Related papers (2022-07-01T23:28:16Z) - Simple and Effective Unsupervised Speech Synthesis [97.56065543192699]
We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe.
Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus.
arXiv Detail & Related papers (2022-04-06T00:19:13Z) - LaughNet: synthesizing laughter utterances from waveform silhouettes and
a single laughter example [55.10864476206503]
We propose a model called LaughNet for synthesizing laughter by using waveform silhouettes as inputs.
The results show that LaughNet can synthesize laughter utterances with moderate quality and retain the characteristics of the training example.
arXiv Detail & Related papers (2021-10-11T00:45:07Z) - Using Deep Learning Techniques and Inferential Speech Statistics for AI
Synthesised Speech Recognition [0.0]
We propose a model that can help discriminate a synthesized speech from an actual human speech and also identify the source of such a synthesis.
The model outperforms the state-of-the-art approaches by classifying the AI synthesized audio from real human speech with an error rate of 1.9% and detecting the underlying architecture with an accuracy of 97%.
arXiv Detail & Related papers (2021-07-23T18:43:10Z) - Audio Adversarial Examples: Attacks Using Vocal Masks [0.0]
We construct audio adversarial examples on automatic Speech-To-Text systems.
We produce an another by overlaying an audio vocal mask generated from the original audio.
We apply our audio adversarial attack to five SOTA STT systems: DeepSpeech, Julius, Kaldi, wav2letter@anywhere and CMUSphinx.
arXiv Detail & Related papers (2021-02-04T05:21:10Z) - Speech Synthesis as Augmentation for Low-Resource ASR [7.2244067948447075]
Speech synthesis might hold the key to low-resource speech recognition.
Data augmentation techniques have become an essential part of modern speech recognition training.
Speech synthesis techniques have been rapidly getting closer to the goal of achieving human-like speech.
arXiv Detail & Related papers (2020-12-23T22:19:42Z) - Self-supervised reinforcement learning for speaker localisation with the
iCub humanoid robot [58.2026611111328]
Looking at a person's face is one of the mechanisms that humans rely on when it comes to filtering speech in noisy environments.
Having a robot that can look toward a speaker could benefit ASR performance in challenging environments.
We propose a self-supervised reinforcement learning-based framework inspired by the early development of humans.
arXiv Detail & Related papers (2020-11-12T18:02:15Z) - Detection of AI-Synthesized Speech Using Cepstral & Bispectral
Statistics [0.0]
We propose an approach to distinguish human speech from AI synthesized speech.
Higher-order statistics have less correlation for human speech in comparison to a synthesized speech.
Also, Cepstral analysis revealed a durable power component in human speech that is missing for a synthesized speech.
arXiv Detail & Related papers (2020-09-03T21:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.