HLTCOE JHU Submission to the Voice Privacy Challenge 2024
- URL: http://arxiv.org/abs/2409.08913v2
- Date: Tue, 17 Sep 2024 14:39:44 GMT
- Title: HLTCOE JHU Submission to the Voice Privacy Challenge 2024
- Authors: Henry Li Xinyuan, Zexin Cai, Ashi Garg, Kevin Duh, Leibny Paola GarcĂa-Perera, Sanjeev Khudanpur, Nicholas Andrews, Matthew Wiesner,
- Abstract summary: We present a number of systems for the Voice Privacy Challenge.
We find that while voice conversion systems better preserve emotional content, they struggle to conceal speaker identity in semi-white-box attack scenarios.
We propose a random admixture system which seeks to balance out the strengths and weaknesses of the two categories of systems.
- Score: 31.94758615908198
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a number of systems for the Voice Privacy Challenge, including voice conversion based systems such as the kNN-VC method and the WavLM voice Conversion method, and text-to-speech (TTS) based systems including Whisper-VITS. We found that while voice conversion systems better preserve emotional content, they struggle to conceal speaker identity in semi-white-box attack scenarios; conversely, TTS methods perform better at anonymization and worse at emotion preservation. Finally, we propose a random admixture system which seeks to balance out the strengths and weaknesses of the two category of systems, achieving a strong EER of over 40% while maintaining UAR at a respectable 47%.
Related papers
- VocalCrypt: Novel Active Defense Against Deepfake Voice Based on Masking Effect [2.417762825674103]
rapid advancements in AI voice cloning, fueled by machine learning, have significantly impacted text-to-speech (TTS) and voice conversion (VC) fields.
We propose a novel active defense method, VocalCrypt, which embeds pseudo-timbre (jamming information) based on SFS into audio segments that are imperceptible to the human ear.
In comparison to existing methods, such as adversarial noise incorporation, VocalCrypt significantly enhances robustness and real-time performance.
arXiv Detail & Related papers (2025-02-14T17:43:01Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - V-Cloak: Intelligibility-, Naturalness- & Timbre-Preserving Real-Time
Voice Anonymization [0.0]
We develop a voice anonymization system, named V-Cloak, which attains real-time voice anonymization.
Our designed anonymizer features a one-shot generative model that modulates the features of the original audio at different frequency levels.
Experiment results confirm that V-Cloak outperforms five baselines in terms of anonymity performance.
arXiv Detail & Related papers (2022-10-27T02:58:57Z) - VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge
transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC)
The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z) - Voice Filter: Few-shot text-to-speech speaker adaptation using voice
conversion as a post-processing module [16.369219400819134]
State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech.
When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations.
We propose a novel extremely low-resource TTS method called Voice Filter that uses as little as one minute of speech from a target speaker.
arXiv Detail & Related papers (2022-02-16T16:12:21Z) - SIG-VC: A Speaker Information Guided Zero-shot Voice Conversion System
for Both Human Beings and Machines [15.087294549955304]
We aim to obtain intermediate representations for speaker-content disentanglement of speech.
Speaker information control is added to our system to maintain the voice cloning performance.
Results show that our proposed system significantly reduces the trade-off problem in zero-shot voice conversion.
arXiv Detail & Related papers (2021-11-06T06:22:45Z) - Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
We propose Voicy, a new VC framework particularly tailored for noisy speech.
Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z) - USTC-NELSLIP System Description for DIHARD-III Challenge [78.40959509760488]
The innovation of our system lies in the combination of various front-end techniques to solve the diarization problem.
Our best system achieved DERs of 11.30% in track 1 and 16.78% in track 2 on evaluation set.
arXiv Detail & Related papers (2021-03-19T07:00:51Z) - The Sequence-to-Sequence Baseline for the Voice Conversion Challenge
2020: Cascading ASR and TTS [66.06385966689965]
This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020.
We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic speech recognition (ASR) model.
We revisit this method under a sequence-to-sequence (seq2seq) framework by utilizing ESPnet, an open-source end-to-end speech processing toolkit.
arXiv Detail & Related papers (2020-10-06T02:27:38Z) - Enhancing Speech Intelligibility in Text-To-Speech Synthesis using
Speaking Style Conversion [17.520533341887642]
We propose a novel transfer learning approach using Tacotron and WaveRNN based TTS synthesis.
The proposed speech system exploits two modification strategies: (a) Lombard speaking style data and (b) Spectral Shaping and Dynamic Range Compression (SSDRC)
Intelligibility enhancement as quantified by the Intelligibility in Bits measure shows that the proposed Lombard-SSDRC TTS system shows significant relative improvement between 110% and 130% in speech-shaped noise (SSN) and 47% to 140% in competing-speaker noise (CSN)
arXiv Detail & Related papers (2020-08-13T10:51:56Z) - NAUTILUS: a Versatile Voice Cloning System [44.700803634034486]
NAUTILUS can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.
It can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm.
It achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech.
arXiv Detail & Related papers (2020-05-22T05:00:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.