Audio Adversarial Examples: Attacks Using Vocal Masks
- URL: http://arxiv.org/abs/2102.02417v2
- Date: Sat, 6 Feb 2021 03:31:23 GMT
- Title: Audio Adversarial Examples: Attacks Using Vocal Masks
- Authors: Kai Yuan Tay, Lynnette Ng, Wei Han Chua, Lucerne Loke, Danqi Ye,
Melissa Chua
- Abstract summary: We construct audio adversarial examples on automatic Speech-To-Text systems.
We produce an another by overlaying an audio vocal mask generated from the original audio.
We apply our audio adversarial attack to five SOTA STT systems: DeepSpeech, Julius, Kaldi, wav2letter@anywhere and CMUSphinx.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We construct audio adversarial examples on automatic Speech-To-Text systems .
Given any audio waveform, we produce an another by overlaying an audio vocal
mask generated from the original audio. We apply our audio adversarial attack
to five SOTA STT systems: DeepSpeech, Julius, Kaldi, wav2letter@anywhere and
CMUSphinx. In addition, we engaged human annotators to transcribe the
adversarial audio. Our experiments show that these adversarial examples fool
State-Of-The-Art Speech-To-Text systems, yet humans are able to consistently
pick out the speech. The feasibility of this attack introduces a new domain to
study machine and human perception of speech.
Related papers
- Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models [5.942307521138583]
We show that special tokens' can be exploited by adversarial attacks to manipulate the model's behavior.
We propose a simple yet effective method to learn a universal acoustic realization of Whisper's $texttt|endoftext|>$ token.
Experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper ASR model for over 97% of speech samples.
arXiv Detail & Related papers (2024-05-09T22:59:23Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - FOOCTTS: Generating Arabic Speech with Acoustic Environment for Football
Commentator [8.89134799076718]
The application gets the text from the user, applies text pre-processing such as vowelization, followed by the commentator's speech synthesizer.
Our pipeline included Arabic automatic speech recognition for data labeling, CTC segmentation, transcription vowelization to match speech, and fine-tuning the TTS.
arXiv Detail & Related papers (2023-06-07T12:33:02Z) - Combining Automatic Speaker Verification and Prosody Analysis for
Synthetic Speech Detection [15.884911752869437]
We present a novel approach for synthetic speech detection, exploiting the combination of two high-level semantic properties of the human voice.
On one side, we focus on speaker identity cues and represent them as speaker embeddings extracted using a state-of-the-art method for the automatic speaker verification task.
On the other side, voice prosody, intended as variations in rhythm, pitch or accent in speech, is extracted through a specialized encoder.
arXiv Detail & Related papers (2022-10-31T11:03:03Z) - Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos [54.08224321456871]
The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language.
The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model.
The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model.
arXiv Detail & Related papers (2022-06-09T14:15:37Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech
Recognition [60.84668086976436]
An unsupervised text-to-speech synthesis (TTS) system learns to generate the speech waveform corresponding to any written sentence in a language.
This paper proposes an unsupervised TTS system by leveraging recent advances in unsupervised automatic speech recognition (ASR)
Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each.
arXiv Detail & Related papers (2022-03-29T17:57:53Z) - "Hello, It's Me": Deep Learning-based Speech Synthesis Attacks in the
Real World [14.295573703789493]
Advances in deep learning have introduced a new wave of voice synthesis tools, capable of producing audio that sounds as if spoken by a target speaker.
This paper documents efforts and findings from a comprehensive experimental study on the impact of deep-learning based speech synthesis attacks on both human listeners and machines.
We find that both humans and machines can be reliably fooled by synthetic speech and that existing defenses against synthesized speech fall short.
arXiv Detail & Related papers (2021-09-20T14:53:22Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.