Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context
Prediction Network
- URL: http://arxiv.org/abs/2109.10724v1
- Date: Wed, 22 Sep 2021 13:29:10 GMT
- Title: Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context
Prediction Network
- Authors: Takaaki Saeki, Shinnosuke Takamichi, and Hiroshi Saruwatari
- Abstract summary: We propose an incremental TTS method that directly predicts the unobserved future context with a lightweight model.
Experimental results show that the proposed method requires about ten times less inference time to achieve comparable synthetic speech quality.
- Score: 41.4599368523939
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Incremental text-to-speech (TTS) synthesis generates utterances in small
linguistic units for the sake of real-time and low-latency applications. We
previously proposed an incremental TTS method that leverages a large
pre-trained language model to take unobserved future context into account
without waiting for the subsequent segment. Although this method achieves
comparable speech quality to that of a method that waits for the future
context, it entails a huge amount of processing for sampling from the language
model at each time step. In this paper, we propose an incremental TTS method
that directly predicts the unobserved future context with a lightweight model,
instead of sampling words from the large-scale language model. We perform
knowledge distillation from a GPT2-based context prediction network into a
simple recurrent model by minimizing a teacher-student loss defined between the
context embedding vectors of those models. Experimental results show that the
proposed method requires about ten times less inference time to achieve
comparable synthetic speech quality to that of our previous method, and it can
perform incremental synthesis much faster than the average speaking speed of
human English speakers, demonstrating the availability of our method to
real-time applications.
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - The Interpreter Understands Your Meaning: End-to-end Spoken Language
Understanding Aided by Speech Translation [13.352795145385645]
Speech translation (ST) is a good means of pretraining speech models for end-to-end spoken language understanding.
We show that our models reach higher performance over baselines on monolingual and multilingual intent classification.
We also create new benchmark datasets for speech summarization and low-resource/zero-shot transfer from English to French or Spanish.
arXiv Detail & Related papers (2023-05-16T17:53:03Z) - Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low
Resource Languages [15.32264927462068]
We propose an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data.
The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones.
We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios.
arXiv Detail & Related papers (2023-03-28T01:26:00Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - An Exploration of Prompt Tuning on Generative Spoken Language Model for
Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM)
Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z) - Differentiable Duration Modeling for End-to-End Text-to-Speech [6.571447892202893]
parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis.
We propose a differentiable duration method for learning monotonic sequences between input and output.
Our model learns to perform high-fidelity synthesis through a combination of adversarial training and matching the total ground-truth duration.
arXiv Detail & Related papers (2022-03-21T15:14:44Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - End-to-End Text-to-Speech using Latent Duration based on VQ-VAE [48.151894340550385]
Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS)
We propose a new TTS framework using explicit duration modeling that incorporates duration as a discrete latent variable to TTS.
arXiv Detail & Related papers (2020-10-19T15:34:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.