Disentangling segmental and prosodic factors to non-native speech comprehensibility
- URL: http://arxiv.org/abs/2408.10997v1
- Date: Tue, 20 Aug 2024 16:43:55 GMT
- Title: Disentangling segmental and prosodic factors to non-native speech comprehensibility
- Authors: Waris Quamer, Ricardo Gutierrez-Osuna,
- Abstract summary: Current accent conversion systems do not disentangle the two main sources of non-native accent: segmental and prosodic characteristics.
We present an AC system that decouples voice quality from accent, but also disentangles the latter into its segmental and prosodic characteristics.
We conduct perceptual listening tests to quantify the individual contributions of segmental features and prosody on the perceived comprehensibility of non-native speech.
- Score: 11.098498920630782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current accent conversion (AC) systems do not disentangle the two main sources of non-native accent: segmental and prosodic characteristics. Being able to manipulate a non-native speaker's segmental and/or prosodic channels independently is critical to quantify how these two channels contribute to speech comprehensibility and social attitudes. We present an AC system that not only decouples voice quality from accent, but also disentangles the latter into its segmental and prosodic characteristics. The system is able to generate accent conversions that combine (1) the segmental characteristics from a source utterance, (2) the voice characteristics from a target utterance, and (3) the prosody of a reference utterance. We show that vector quantization of acoustic embeddings and removal of consecutive duplicated codewords allows the system to transfer prosody and improve voice similarity. We conduct perceptual listening tests to quantify the individual contributions of segmental features and prosody on the perceived comprehensibility of non-native speech. Our results indicate that, contrary to prior research in non-native speech, segmental features have a larger impact on comprehensibility than prosody. The proposed AC system may also be used to study how segmental and prosody cues affect social attitudes towards non-native speech.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Analyzing Speech Unit Selection for Textless Speech-to-Speech Translation [23.757896930482342]
This work explores the selection process through a study of downstream tasks.
Units that perform well in resynthesis performance do not necessarily correlate with those that enhance translation efficacy.
arXiv Detail & Related papers (2024-07-08T08:53:26Z) - A Novel Labeled Human Voice Signal Dataset for Misbehavior Detection [0.7223352886780369]
This research highlights the significance of voice tone and delivery in automated machine-learning systems for voice analysis and recognition.
It contributes to the broader field of voice signal analysis by elucidating the impact of human behaviour on the perception and categorization of voice signals.
arXiv Detail & Related papers (2024-06-28T18:55:07Z) - Transfer the linguistic representations from TTS to accent conversion
with non-parallel data [7.376032484438044]
Accent conversion aims to convert the accent of a source speech to a target accent, preserving the speaker's identity.
This paper introduces a novel non-autoregressive framework for accent conversion that learns accent-agnostic linguistic representations and employs them to convert the accent in the source speech.
arXiv Detail & Related papers (2024-01-07T16:39:34Z) - Disentangling Prosody Representations with Unsupervised Speech
Reconstruction [22.873286925385543]
The aim of this paper is to address the disentanglement of emotional prosody from speech based on unsupervised reconstruction.
Specifically, we identify, design, implement and integrate three crucial components in our proposed speech reconstruction model Prosody2Vec.
We first pretrain the Prosody2Vec representations on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks.
arXiv Detail & Related papers (2022-12-14T01:37:35Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - On the Mutual Information between Source and Filter Contributions for
Voice Pathology Detection [11.481208551940998]
This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal.
Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody.
arXiv Detail & Related papers (2020-01-02T10:04:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.