AccentBox: Towards High-Fidelity Zero-Shot Accent Generation
- URL: http://arxiv.org/abs/2409.09098v1
- Date: Fri, 13 Sep 2024 06:05:10 GMT
- Title: AccentBox: Towards High-Fidelity Zero-Shot Accent Generation
- Authors: Jinzuomu Zhong, Korin Richmond, Zhiba Su, Siqi Sun,
- Abstract summary: We propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS.
In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers.
In the second stage, we condition ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model.
- Score: 20.40688498862892
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While recent Zero-Shot Text-to-Speech (ZS-TTS) models have achieved high naturalness and speaker similarity, they fall short in accent fidelity and control. To address this issue, we propose zero-shot accent generation that unifies Foreign Accent Conversion (FAC), accented TTS, and ZS-TTS, with a novel two-stage pipeline. In the first stage, we achieve state-of-the-art (SOTA) on Accent Identification (AID) with 0.56 f1 score on unseen speakers. In the second stage, we condition ZS-TTS system on the pretrained speaker-agnostic accent embeddings extracted by the AID model. The proposed system achieves higher accent fidelity on inherent/cross accent generation, and enables unseen accent generation.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Accented Speech Recognition With Accent-specific Codebooks [53.288874858671576]
Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems.
Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR.
We propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks.
arXiv Detail & Related papers (2023-10-24T16:10:58Z) - DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech [30.110058338155675]
Cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres.
We propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style.
By combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis.
arXiv Detail & Related papers (2023-06-25T06:46:36Z) - Modelling low-resource accents without accent-specific TTS frontend [4.185844990558149]
This work focuses on modelling a speaker's accent that does not have a dedicated text-to-speech (TTS)
We propose an approach whereby we first augment the target accent data to sound like the donor voice via voice conversion.
We then train a multi-speaker multi-accent TTS model on the combination of recordings and synthetic data, to generate the target accent.
arXiv Detail & Related papers (2023-01-11T18:00:29Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Explicit Intensity Control for Accented Text-to-speech [65.35831577398174]
How to control the intensity of accent in the process of TTS is a very interesting research direction.
Recent work design a speaker-versaadrial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity.
This paper propose a new intuitive and explicit accent intensity control scheme for accented TTS.
arXiv Detail & Related papers (2022-10-27T12:23:41Z) - Controllable Accented Text-to-Speech Synthesis [76.80549143755242]
We propose a neural TTS architecture that allows us to control the accent and its intensity during inference.
This is the first study of accented TTS synthesis with explicit intensity control.
arXiv Detail & Related papers (2022-09-22T06:13:07Z) - Black-box Adaptation of ASR for Accented Speech [52.63060669715216]
We introduce the problem of adapting a black-box, cloud-based ASR system to speech from a target accent.
We propose a novel coupling of an open-source accent-tuned local model with the black-box service.
Our fine-grained merging algorithm is better at fixing accent errors than existing word-level combination strategies.
arXiv Detail & Related papers (2020-06-24T07:07:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.