Defending Your Voice: Adversarial Attack on Voice Conversion
- URL: http://arxiv.org/abs/2005.08781v3
- Date: Tue, 4 May 2021 15:02:26 GMT
- Title: Defending Your Voice: Adversarial Attack on Voice Conversion
- Authors: Chien-yu Huang, Yist Y. Lin, Hung-yi Lee, Lin-shan Lee
- Abstract summary: We report the first known attempt to perform adversarial attack on voice conversion.
We introduce human noise imperceptible into the utterances of a speaker whose voice is to be defended.
It was shown that the speaker characteristics of the converted utterances were made obviously different from those of the defended speaker.
- Score: 70.19396655909455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Substantial improvements have been achieved in recent years in voice
conversion, which converts the speaker characteristics of an utterance into
those of another speaker without changing the linguistic content of the
utterance. Nonetheless, the improved conversion technologies also led to
concerns about privacy and authentication. It thus becomes highly desired to be
able to prevent one's voice from being improperly utilized with such voice
conversion technologies. This is why we report in this paper the first known
attempt to perform adversarial attack on voice conversion. We introduce human
imperceptible noise into the utterances of a speaker whose voice is to be
defended. Given these adversarial examples, voice conversion models cannot
convert other utterances so as to sound like being produced by the defended
speaker. Preliminary experiments were conducted on two currently
state-of-the-art zero-shot voice conversion models. Objective and subjective
evaluation results in both white-box and black-box scenarios are reported. It
was shown that the speaker characteristics of the converted utterances were
made obviously different from those of the defended speaker, while the
adversarial examples of the defended speaker are not distinguishable from the
authentic utterances.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Who is Authentic Speaker [4.822108779108675]
Voice conversion can pose potential social issues when manipulated voices are employed for deceptive purposes.
It is a big challenge to find who are real speakers from the converted voices as the acoustic characteristics of source speakers are changed greatly.
This study is conducted with the assumption that certain information from the source speakers persists, even when their voices undergo conversion into different target voices.
arXiv Detail & Related papers (2024-04-30T23:41:00Z) - Self-Supervised Speech Representations Preserve Speech Characteristics
while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations.
Converted voices retain a low word error rate within 1% of the original voice.
Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z) - Toward Degradation-Robust Voice Conversion [94.60503904292916]
Any-to-any voice conversion technologies convert the vocal timbre of an utterance to any speaker even unseen during training.
It is difficult to collect clean utterances of a speaker, and they are usually degraded by noises or reverberations.
We report in this paper the first comprehensive study on the degradation of robustness of any-to-any voice conversion.
arXiv Detail & Related papers (2021-10-14T17:00:34Z) - On Prosody Modeling for ASR+TTS based Voice Conversion [82.65378387724641]
In voice conversion, an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents.
Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity.
We propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP)
arXiv Detail & Related papers (2021-07-20T13:30:23Z) - Many-to-Many Voice Conversion based Feature Disentanglement using
Variational Autoencoder [2.4975981795360847]
We propose a new method based on feature disentanglement to tackle many to many voice conversion.
The method has the capability to disentangle speaker identity and linguistic content from utterances.
It can convert from many source speakers to many target speakers with a single autoencoder network.
arXiv Detail & Related papers (2021-07-11T13:31:16Z) - Investigating on Incorporating Pretrained and Learnable Speaker
Representations for Multi-Speaker Multi-Style Text-to-Speech [54.75722224061665]
In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations.
The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers.
arXiv Detail & Related papers (2021-03-06T10:14:33Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.