Related papers: DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech

URL: http://arxiv.org/abs/2306.14145v1
Date: Sun, 25 Jun 2023 06:46:36 GMT
Title: DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech
Authors: Sen Liu, Yiwei Guo, Chenpeng Du, Xie Chen, Kai Yu
Abstract summary: Cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres. We propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style. By combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis.
Score: 30.110058338155675
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Although high-fidelity speech can be obtained for intralingual speech synthesis, cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres(i.e. speaker similarity) and eliminate the accents from their first language(i.e. nativeness). In this paper, we demonstrated that vector-quantized(VQ) acoustic feature contains less speaker information than mel-spectrogram. Based on this finding, we propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style. Here, one embedding is fed to the acoustic model to learn the linguistic speaking style, while the other one is integrated into the vocoder to mimic the target speaker's timbre. Experiments show that by combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis, especially in terms of nativeness.

Related papers

Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis [57.5830191022097]
A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation.<n>We adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs.<n>Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines.
arXiv Detail & Related papers (2025-11-07T17:07:56Z)
FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation [10.73307957038715]
FMSD-TTS is a few-shot, multi-speaker, multi-dialect text-to-speech framework.<n>It synthesizes parallel dialectal speech from limited reference audio and explicit dialect labels.
arXiv Detail & Related papers (2025-05-20T13:35:55Z)
Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT [29.167336994990542]
Cross-dialect text-to-speech (CD-TTS) is a task to synthesize learned speakers' voices in non-native dialects. We present a novel TTS model comprising three sub-modules to perform competitively at this task.
arXiv Detail & Related papers (2024-09-11T13:40:27Z)
Learning Speech Representation From Contrastive Token-Acoustic Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space. The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z)
CrossSpeech: Speaker-independent Acoustic Representation for Cross-lingual Speech Synthesis [7.6883773606941075]
CrossSpeech improves the quality of cross-lingual speech by effectively disentangling speaker and language information. From the experiments, we verify that CrossSpeech achieves significant improvements in cross-lingual TTS.
arXiv Detail & Related papers (2023-02-28T07:51:10Z)
Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model. It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z)
ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks. We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes. Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z)
AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis. We model the speaker characteristics systematically to improve the generalization on new speakers. Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z)
Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus. We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z)
Cross-lingual Low Resource Speaker Adaptation Using Phonological Features [2.8080708404213373]
We train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages. With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature.
arXiv Detail & Related papers (2021-11-17T12:33:42Z)
Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z)
Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS) A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.