Related papers: Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

URL: http://arxiv.org/abs/2408.00284v1
Date: Thu, 1 Aug 2024 04:57:31 GMT
Title: Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation
Authors: Xinhan Di, Zihao Chen, Yunming Liang, Junjie Zheng, Yihua Wang, Chaofan Ding,
Abstract summary: Bailing-TTS is a family of large-scale TTS models capable of generating high-quality Chinese dialectal speech. The Chinese dialectal representation learning is developed using a specific transformer architecture and multi-stage training processes. Experiments demonstrate that Bailing-TTS generates Chinese dialectal speech towards human-like spontaneous representation.
Score: 3.9166923630129604
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large-scale text-to-speech (TTS) models have made significant progress recently.However, they still fall short in the generation of Chinese dialectal speech. Toaddress this, we propose Bailing-TTS, a family of large-scale TTS models capable of generating high-quality Chinese dialectal speech. Bailing-TTS serves as a foundation model for Chinese dialectal speech generation. First, continual semi-supervised learning is proposed to facilitate the alignment of text tokens and speech tokens. Second, the Chinese dialectal representation learning is developed using a specific transformer architecture and multi-stage training processes. With the proposed design of novel network architecture and corresponding strategy, Bailing-TTS is able to generate Chinese dialectal speech from text effectively and efficiently. Experiments demonstrate that Bailing-TTS generates Chinese dialectal speech towards human-like spontaneous representation. Readers are encouraged to listen to demos at \url{https://c9412600.github.io/bltts_tech_report/index.html}.

Related papers

What Makes a Good Speech Tokenizer for LLM-Centric Speech Generation? A Systematic Study [58.55905182336196]
Speech-language models (SLMs) offer a promising path toward unifying speech and text understanding and generation.<n>We investigate the role of speech tokenizer designs in LLM-centric SLMs, augmented by speech heads and speaker modeling.<n>We introduce multi-token prediction (MTP) into SLMs, enabling each hidden state to decode multiple speech tokens.
arXiv Detail & Related papers (2025-06-14T15:26:31Z)
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z)
MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model. It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone. We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z)
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes. Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z)
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech [37.942466944970704]
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages.
arXiv Detail & Related papers (2022-10-27T14:09:48Z)
A Novel Chinese Dialect TTS Frontend with Non-Autoregressive Neural Machine Translation [6.090922774386845]
We propose a novel Chinese dialect TTS with a translation module. It helps to convert Mandarin text into idiomatic expressions with correct orthography and grammar. It is the first known work to incorporate translation with TTS.
arXiv Detail & Related papers (2022-06-10T07:46:34Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
Towards Natural Bilingual and Code-Switched Speech Synthesis Based on Mix of Monolingual Recordings and Cross-Lingual Voice Conversion [28.830575877307176]
It is not easy to obtain a bilingual corpus from a speaker who achieves native-level fluency in both languages. A Tacotron2-based cross-lingual voice conversion system is employed to generate the Mandarin speaker's English speech and the English speaker's Mandarin speech. The obtained bilingual data are then augmented with code-switched utterances synthesized using a Transformer model.
arXiv Detail & Related papers (2020-10-16T03:51:00Z)
Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS [74.11899135025503]
We extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks. We show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.
arXiv Detail & Related papers (2020-08-11T07:57:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.