Disentangling Voice and Content with Self-Supervision for Speaker
Recognition
- URL: http://arxiv.org/abs/2310.01128v3
- Date: Wed, 1 Nov 2023 16:27:54 GMT
- Title: Disentangling Voice and Content with Self-Supervision for Speaker
Recognition
- Authors: Tianchi Liu, Kong Aik Lee, Qiongqiong Wang, Haizhou Li
- Abstract summary: This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
- Score: 57.446013973449645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For speaker recognition, it is difficult to extract an accurate speaker
representation from speech because of its mixture of speaker traits and
content. This paper proposes a disentanglement framework that simultaneously
models speaker traits and content variability in speech. It is realized with
the use of three Gaussian inference layers, each consisting of a learnable
transition model that extracts distinct speech components. Notably, a
strengthened transition model is specifically designed to model complex speech
dynamics. We also propose a self-supervision method to dynamically disentangle
content without the use of labels other than speaker identities. The efficacy
of the proposed framework is validated via experiments conducted on the
VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and
minDCF, respectively. Since neither additional model training nor data is
specifically needed, it is easily applicable in practical use.
Related papers
- Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly
Disentangled Self-supervised Speech Representations [12.20522794248598]
We propose a zero-shot voice conversion method using speech representations trained with self-supervised learning.
We develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style.
Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its representation.
arXiv Detail & Related papers (2023-02-16T08:10:41Z) - A Single Self-Supervised Model for Many Speech Modalities Enables
Zero-Shot Modality Transfer [31.028408352051684]
We present u-HuBERT, a self-supervised pre-training framework that can leverage both multimodal and unimodal speech.
Our single model yields 1.2%/1.4%/27.2% speech recognition word error rate on LRS3 with audio-visual/audio/visual input.
arXiv Detail & Related papers (2022-07-14T16:21:33Z) - Self supervised learning for robust voice cloning [3.7989740031754806]
We use features learned in a self-supervised framework to produce high quality speech representations.
The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture.
This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice.
arXiv Detail & Related papers (2022-04-07T13:05:24Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Cross-speaker Style Transfer with Prosody Bottleneck in Neural Speech
Synthesis [8.603535906880937]
Cross-speaker style transfer is crucial to the applications of multi-style and expressive speech synthesis at scale.
Existing style transfer methods are still far behind real application needs.
We propose a cross-speaker style transfer text-to-speech model with explicit prosody bottleneck.
arXiv Detail & Related papers (2021-07-27T02:43:57Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer)
In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms.
We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z) - A Hierarchical Transformer with Speaker Modeling for Emotion Recognition
in Conversation [12.065178204539693]
Emotion Recognition in Conversation (ERC) is a personalized and interactive emotion recognition task.
Current method models speakers' interactions by building a relation between every two speakers.
We simplify the complicated modeling to a binary version: Intra-Speaker and Inter-Speaker dependencies.
arXiv Detail & Related papers (2020-12-29T14:47:35Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.