Disentangling Prosody Representations with Unsupervised Speech
Reconstruction
- URL: http://arxiv.org/abs/2212.06972v2
- Date: Tue, 26 Sep 2023 02:42:11 GMT
- Title: Disentangling Prosody Representations with Unsupervised Speech
Reconstruction
- Authors: Leyuan Qu, Taihao Li, Cornelius Weber, Theresa Pekarek-Rosin, Fuji Ren
and Stefan Wermter
- Abstract summary: The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction.
Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec.
We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks.
- Score: 22.873286925385543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human speech can be characterized by different components, including semantic
content, speaker identity and prosodic information. Significant progress has
been made in disentangling representations for semantic content and speaker
identity in Automatic Speech Recognition (ASR) and speaker verification tasks
respectively. However, it is still an open challenging research question to
extract prosodic information because of the intrinsic association of different
attributes, such as timbre and rhythm, and because of the need for supervised
training schemes to achieve robust large-scale and speaker-independent ASR. The
aim of this paper is to address the disentanglement of emotional prosody from
speech based on unsupervised reconstruction. Specifically, we identify, design,
implement and integrate three crucial components in our proposed speech
reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech
signals into discrete units for semantic content, (2) a pretrained speaker
verification model to generate speaker identity embeddings, and (3) a trainable
prosody encoder to learn prosody representations. We first pretrain the
Prosody2Vec representations on unlabelled emotional speech corpora, then
fine-tune the model on specific datasets to perform Speech Emotion Recognition
(SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and
unweighted accuracies) and subjective (mean opinion score) evaluations on the
EVC task suggest that Prosody2Vec effectively captures general prosodic
features that can be smoothly transferred to other emotional speech. In
addition, our SER experiments on the IEMOCAP dataset reveal that the prosody
features learned by Prosody2Vec are complementary and beneficial for the
performance of widely used speech pretraining models and surpass the
state-of-the-art methods when combining Prosody2Vec with HuBERT
representations.
Related papers
- Exploring VQ-VAE with Prosody Parameters for Speaker Anonymization [0.5497663232622965]
This article investigates a novel speaker anonymization approach using an end-to-end network based on a Vector-Quantized Variational Auto-Encoder (VQ-VAE)
It is designed to disentangle these components to specifically target and modify the speaker identity while preserving the linguistic and emotionalcontent.
Findings indicate that this method outperforms most baseline techniques in preserving emotional information.
arXiv Detail & Related papers (2024-09-24T08:55:10Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Self-supervised speech unit discovery from articulatory and acoustic
features using VQ-VAE [2.771610203951056]
This study examines how articulatory information can be used for discovering speech units in a self-supervised setting.
We used vector-quantized variational autoencoders (VQ-VAE) to learn discrete representations from articulatory and acoustic speech data.
Experiments were conducted on three different corpora in English and French.
arXiv Detail & Related papers (2022-06-17T14:04:24Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - An Attribute-Aligned Strategy for Learning Speech Representation [57.891727280493015]
We propose an attribute-aligned learning strategy to derive speech representation that can flexibly address these issues by attribute-selection mechanism.
Specifically, we propose a layered-representation variational autoencoder (LR-VAE), which factorizes speech representation into attribute-sensitive nodes.
Our proposed method achieves competitive performances on identity-free SER and a better performance on emotionless SV.
arXiv Detail & Related papers (2021-06-05T06:19:14Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.