Are disentangled representations all you need to build speaker
anonymization systems?
- URL: http://arxiv.org/abs/2208.10497v2
- Date: Wed, 24 Aug 2022 08:25:54 GMT
- Title: Are disentangled representations all you need to build speaker
anonymization systems?
- Authors: Pierre Champion (MULTISPEECH, LIUM), Denis Jouvet (MULTISPEECH),
Anthony Larcher (LIUM)
- Abstract summary: Speech signals contain a lot of sensitive information, such as the speaker's identity.
Speaker anonymization aims to transform a speech signal to remove the source speaker's identity while leaving the spoken content unchanged.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech signals contain a lot of sensitive information, such as the speaker's
identity, which raises privacy concerns when speech data get collected. Speaker
anonymization aims to transform a speech signal to remove the source speaker's
identity while leaving the spoken content unchanged. Current methods perform
the transformation by relying on content/speaker disentanglement and voice
conversion. Usually, an acoustic model from an automatic speech recognition
system extracts the content representation while an x-vector system extracts
the speaker representation. Prior work has shown that the extracted features
are not perfectly disentangled. This paper tackles how to improve features
disentanglement, and thus the converted anonymized speech. We propose enhancing
the disentanglement by removing speaker information from the acoustic model
using vector quantization. Evaluation done using the VoicePrivacy 2022 toolkit
showed that vector quantization helps conceal the original speaker identity
while maintaining utility for speech recognition.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Privacy-Preserving Speech Representation Learning using Vector
Quantization [0.0]
Speech signals contain a lot of sensitive information, such as the speaker's identity, which raises privacy concerns.
This paper aims to produce an anonymous representation while preserving speech recognition performance.
arXiv Detail & Related papers (2022-03-15T14:01:11Z) - Differentially Private Speaker Anonymization [44.90119821614047]
Sharing real-world speech utterances is key to the training and deployment of voice-based services.
Speaker anonymization aims to remove speaker information from a speech utterance while leaving its linguistic and prosodic attributes intact.
We show that disentanglement is indeed not perfect: linguistic and prosodic attributes still contain speaker information.
arXiv Detail & Related papers (2022-02-23T23:20:30Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - Adversarial Disentanglement of Speaker Representation for
Attribute-Driven Privacy Preservation [17.344080729609026]
We introduce the concept of attribute-driven privacy preservation in speaker voice representation.
It allows a person to hide one or more personal aspects to a potential malicious interceptor and to the application provider.
We propose an adversarial autoencoding method that disentangles in the voice representation a given speaker attribute thus allowing its concealment.
arXiv Detail & Related papers (2020-12-08T14:47:23Z) - Speaker De-identification System using Autoencoders and Adversarial
Training [58.720142291102135]
We propose a speaker de-identification system based on adversarial training and autoencoders.
Experimental results show that combining adversarial learning and autoencoders increase the equal error rate of a speaker verification system.
arXiv Detail & Related papers (2020-11-09T19:22:05Z) - Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model.
It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z) - From Speaker Verification to Multispeaker Speech Synthesis, Deep
Transfer with Feedback Constraint [11.982748481062542]
This paper presents a system involving feedback constraint for multispeaker speech synthesis.
We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network.
The model is trained and evaluated on publicly available datasets.
arXiv Detail & Related papers (2020-05-10T06:11:37Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.