PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and
Pause-based Prosody Modeling
- URL: http://arxiv.org/abs/2306.07489v1
- Date: Tue, 13 Jun 2023 01:36:55 GMT
- Title: PauseSpeech: Natural Speech Synthesis via Pre-trained Language Model and
Pause-based Prosody Modeling
- Authors: Ji-Sang Hwang, Sang-Hoon Lee, and Seong-Whan Lee
- Abstract summary: We propose PuaseSpeech, a speech synthesis system with a pre-trained language model and pause-based prosody modeling.
Experimental results show PauseSpeech outperforms previous models in terms of naturalness.
- Score: 25.966328901566815
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Although text-to-speech (TTS) systems have significantly improved, most TTS
systems still have limitations in synthesizing speech with appropriate
phrasing. For natural speech synthesis, it is important to synthesize the
speech with a phrasing structure that groups words into phrases based on
semantic information. In this paper, we propose PuaseSpeech, a speech synthesis
system with a pre-trained language model and pause-based prosody modeling.
First, we introduce a phrasing structure encoder that utilizes a context
representation from the pre-trained language model. In the phrasing structure
encoder, we extract a speaker-dependent syntactic representation from the
context representation and then predict a pause sequence that separates the
input text into phrases. Furthermore, we introduce a pause-based word encoder
to model word-level prosody based on pause sequence. Experimental results show
PauseSpeech outperforms previous models in terms of naturalness. Furthermore,
in terms of objective evaluations, we can observe that our proposed methods
help the model decrease the distance between ground-truth and synthesized
speech. Audio samples are available at
https://jisang93.github.io/pausespeech-demo/.
Related papers
- Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - HierSpeech++: Bridging the Gap between Semantic and Acoustic
Representation of Speech by Hierarchical Variational Inference for Zero-shot
Speech Synthesis [39.892633589217326]
Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis.
This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC)
arXiv Detail & Related papers (2023-11-21T09:07:11Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - Duration-aware pause insertion using pre-trained language model for
multi-speaker text-to-speech [40.65850332919397]
We propose more powerful pause insertion frameworks based on a pre-trained language model.
Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus.
We also leverage duration-aware pause insertion for more natural multi-speaker TTS.
arXiv Detail & Related papers (2023-02-27T10:40:41Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs.
The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference.
Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z) - JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to
Speech [7.476901945542385]
We present end-to-end text-to-speech (E2E-TTS) model which has a simplified training pipeline and outperforms a cascade of separately learned models.
Our proposed model is jointly trained FastSpeech2 and HiFi-GAN with an alignment module.
Experiments on LJSpeech corpus shows that the proposed model outperforms publicly available, state-of-the-art implementations of ESPNet2-TTS.
arXiv Detail & Related papers (2022-03-31T07:25:11Z) - Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation.
We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead.
When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.