FastPitch: Parallel Text-to-speech with Pitch Prediction
- URL: http://arxiv.org/abs/2006.06873v2
- Date: Tue, 16 Feb 2021 14:23:15 GMT
- Title: FastPitch: Parallel Text-to-speech with Pitch Prediction
- Authors: Adrian {\L}a\'ncucki
- Abstract summary: We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech.
The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive.
- Score: 9.213700601337388
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present FastPitch, a fully-parallel text-to-speech model based on
FastSpeech, conditioned on fundamental frequency contours. The model predicts
pitch contours during inference. By altering these predictions, the generated
speech can be more expressive, better match the semantic of the utterance, and
in the end more engaging to the listener. Uniformly increasing or decreasing
pitch with FastPitch generates speech that resembles the voluntary modulation
of voice. Conditioning on frequency contours improves the overall quality of
synthesized speech, making it comparable to state-of-the-art. It does not
introduce an overhead, and FastPitch retains the favorable, fully-parallel
Transformer architecture, with over 900x real-time factor for mel-spectrogram
synthesis of a typical utterance.
Related papers
- DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech [43.45691362372739]
We propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS)
DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties.
Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.
arXiv Detail & Related papers (2024-09-18T09:36:55Z) - A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - Incremental FastPitch: Chunk-based High Quality Text to Speech [0.7366405857677227]
We propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks.
Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch.
arXiv Detail & Related papers (2024-01-03T14:17:35Z) - HierSpeech++: Bridging the Gap between Semantic and Acoustic
Representation of Speech by Hierarchical Variational Inference for Zero-shot
Speech Synthesis [39.892633589217326]
Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis.
This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC)
arXiv Detail & Related papers (2023-11-21T09:07:11Z) - Improving Joint Speech-Text Representations Without Alignment [92.60384956736536]
We show that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length.
We argue that consistency losses could forgive length differences and simply assume the best alignment.
arXiv Detail & Related papers (2023-08-11T13:28:48Z) - FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis.
FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies.
Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z) - STYLER: Style Modeling with Rapidity and Robustness via
SpeechDecomposition for Expressive and Controllable Neural Text to Speech [2.622482339911829]
STYLER is a novel expressive text-to-speech model with parallelized architecture.
Our novel noise modeling approach from audio using domain adversarial training and Residual Decoding enabled style transfer without transferring noise.
arXiv Detail & Related papers (2021-03-17T07:11:09Z) - Incremental Text to Speech for Neural Sequence-to-Sequence Models using
Reinforcement Learning [60.20205278845412]
Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised.
This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation.
We propose a reinforcement learning based framework to train an agent to make this decision.
arXiv Detail & Related papers (2020-08-07T11:48:05Z) - End-to-End Adversarial Text-to-Speech [33.01223309795122]
We learn to synthesise speech from normalised text or phonemes in an end-to-end manner.
Our proposed generator is feed-forward and thus efficient for both training and inference.
It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses.
arXiv Detail & Related papers (2020-06-05T17:41:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.