Voice-preserving Zero-shot Multiple Accent Conversion
- URL: http://arxiv.org/abs/2211.13282v2
- Date: Sat, 14 Oct 2023 06:27:13 GMT
- Title: Voice-preserving Zero-shot Multiple Accent Conversion
- Authors: Mumin Jin, Prashant Serai, Jilong Wu, Andros Tjandra, Vimal Manohar,
Qing He
- Abstract summary: An accent conversion system changes a speaker's accent but preserves that speaker's voice identity.
We use adversarial learning to disentangle accent dependent features while retaining other acoustic characteristics.
Our model generates audio that sound closer to the target accent and like the original speaker.
- Score: 14.218374374305421
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Most people who have tried to learn a foreign language would have experienced
difficulties understanding or speaking with a native speaker's accent. For
native speakers, understanding or speaking a new accent is likewise a difficult
task. An accent conversion system that changes a speaker's accent but preserves
that speaker's voice identity, such as timbre and pitch, has the potential for
a range of applications, such as communication, language learning, and
entertainment. Existing accent conversion models tend to change the speaker
identity and accent at the same time. Here, we use adversarial learning to
disentangle accent dependent features while retaining other acoustic
characteristics. What sets our work apart from existing accent conversion
models is the capability to convert an unseen speaker's utterance to multiple
accents while preserving its original voice identity. Subjective evaluations
show that our model generates audio that sound closer to the target accent and
like the original speaker.
Related papers
- Improving Self-supervised Pre-training using Accent-Specific Codebooks [48.409296549372414]
accent-aware adaptation technique for self-supervised learning.
On the Mozilla Common Voice dataset, our proposed approach outperforms all other accent-adaptation approaches.
arXiv Detail & Related papers (2024-07-04T08:33:52Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - Modelling low-resource accents without accent-specific TTS frontend [4.185844990558149]
This work focuses on modelling a speaker's accent that does not have a dedicated text-to-speech (TTS)
We propose an approach whereby we first augment the target accent data to sound like the donor voice via voice conversion.
We then train a multi-speaker multi-accent TTS model on the combination of recordings and synthetic data, to generate the target accent.
arXiv Detail & Related papers (2023-01-11T18:00:29Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Explicit Intensity Control for Accented Text-to-speech [65.35831577398174]
How to control the intensity of accent in the process of TTS is a very interesting research direction.
Recent work design a speaker-versaadrial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity.
This paper propose a new intuitive and explicit accent intensity control scheme for accented TTS.
arXiv Detail & Related papers (2022-10-27T12:23:41Z) - Analysis of French Phonetic Idiosyncrasies for Accent Recognition [0.8602553195689513]
Differences in pronunciation, in accent and intonation of speech in general, create one of the most common problems of speech recognition.
We use traditional machine learning techniques and convolutional neural networks, and show that the classical techniques are not sufficiently efficient to solve this problem.
In this paper, we focus our attention on the French accent. We also identify its limitation by understanding the impact of French idiosyncrasies on its spectrograms.
arXiv Detail & Related papers (2021-10-18T10:50:50Z) - Many-to-Many Voice Conversion based Feature Disentanglement using
Variational Autoencoder [2.4975981795360847]
We propose a new method based on feature disentanglement to tackle many to many voice conversion.
The method has the capability to disentangle speaker identity and linguistic content from utterances.
It can convert from many source speakers to many target speakers with a single autoencoder network.
arXiv Detail & Related papers (2021-07-11T13:31:16Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - Defending Your Voice: Adversarial Attack on Voice Conversion [70.19396655909455]
We report the first known attempt to perform adversarial attack on voice conversion.
We introduce human noise imperceptible into the utterances of a speaker whose voice is to be defended.
It was shown that the speaker characteristics of the converted utterances were made obviously different from those of the defended speaker.
arXiv Detail & Related papers (2020-05-18T14:51:54Z) - AccentDB: A Database of Non-Native English Accents to Assist Neural
Speech Recognition [3.028098724882708]
We first spell out the key requirements for creating a well-curated database of speech samples in non-native accents for training and testing robust ASR systems.
We then introduce AccentDB, one such database that contains samples of 4 Indian-English accents collected by us.
We present several accent classification models and evaluate them thoroughly against human-labelled accent classes.
arXiv Detail & Related papers (2020-05-16T12:38:30Z) - Generating Multilingual Voices Using Speaker Space Translation Based on
Bilingual Speaker Data [15.114637085644057]
We show that a simple transform in speaker space can be used to control the degree of accent of a synthetic voice in a language.
The same transform can be applied even to monolingual speakers.
arXiv Detail & Related papers (2020-04-10T10:01:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.