Incremental FastPitch: Chunk-based High Quality Text to Speech
- URL: http://arxiv.org/abs/2401.01755v1
- Date: Wed, 3 Jan 2024 14:17:35 GMT
- Title: Incremental FastPitch: Chunk-based High Quality Text to Speech
- Authors: Muyang Du, Chuan Liu, Junjie Lai
- Abstract summary: We propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks.
Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch.
- Score: 0.7366405857677227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Parallel text-to-speech models have been widely applied for real-time speech
synthesis, and they offer more controllability and a much faster synthesis
process compared with conventional auto-regressive models. Although parallel
models have benefits in many aspects, they become naturally unfit for
incremental synthesis due to their fully parallel architecture such as
transformer. In this work, we propose Incremental FastPitch, a novel FastPitch
variant capable of incrementally producing high-quality Mel chunks by improving
the architecture with chunk-based FFT blocks, training with receptive-field
constrained chunk attention masks, and inference with fixed size past model
states. Experimental results show that our proposal can produce speech quality
comparable to the parallel FastPitch, with a significant lower latency that
allows even lower response time for real-time speech applications.
Related papers
- Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding [11.128340782271305]
We introduce VADUSA, one of the first approaches to accelerate auto-regressive TTS through speculative decoding.
Our results show that VADUSA not only significantly improves inference speed but also enhances performance by incorporating draft heads to predict future speech content auto-regressively.
arXiv Detail & Related papers (2024-10-29T11:12:01Z) - DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech [43.45691362372739]
We propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS)
DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties.
Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.
arXiv Detail & Related papers (2024-09-18T09:36:55Z) - SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis.
FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies.
Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z) - Differentiable Duration Modeling for End-to-End Text-to-Speech [6.571447892202893]
parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis.
We propose a differentiable duration method for learning monotonic sequences between input and output.
Our model learns to perform high-fidelity synthesis through a combination of adversarial training and matching the total ground-truth duration.
arXiv Detail & Related papers (2022-03-21T15:14:44Z) - SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - STYLER: Style Modeling with Rapidity and Robustness via
SpeechDecomposition for Expressive and Controllable Neural Text to Speech [2.622482339911829]
STYLER is a novel expressive text-to-speech model with parallelized architecture.
Our novel noise modeling approach from audio using domain adversarial training and Residual Decoding enabled style transfer without transferring noise.
arXiv Detail & Related papers (2021-03-17T07:11:09Z) - FastPitch: Parallel Text-to-speech with Pitch Prediction [9.213700601337388]
We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech.
The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive.
arXiv Detail & Related papers (2020-06-11T23:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.