Related papers: FastPitch: Parallel Text-to-speech with Pitch Prediction

FastPitch: Parallel Text-to-speech with Pitch Prediction

URL: http://arxiv.org/abs/2006.06873v2
Date: Tue, 16 Feb 2021 14:23:15 GMT
Title: FastPitch: Parallel Text-to-speech with Pitch Prediction
Authors: Adrian {\L}a\'ncucki
Abstract summary: We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive.
Score: 9.213700601337388
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present FastPitch, a fully-parallel text-to-speech model based on FastSpeech, conditioned on fundamental frequency contours. The model predicts pitch contours during inference. By altering these predictions, the generated speech can be more expressive, better match the semantic of the utterance, and in the end more engaging to the listener. Uniformly increasing or decreasing pitch with FastPitch generates speech that resembles the voluntary modulation of voice. Conditioning on frequency contours improves the overall quality of synthesized speech, making it comparable to state-of-the-art. It does not introduce an overhead, and FastPitch retains the favorable, fully-parallel Transformer architecture, with over 900x real-time factor for mel-spectrogram synthesis of a typical utterance.

Related papers

Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis [64.12708207721276]
We introduce a novel pseudo-autoregressive (PAR) language modeling approach that unifies AR and NAR modeling. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data.
arXiv Detail & Related papers (2025-04-14T16:03:21Z)
NaturalL2S: End-to-End High-quality Multispeaker Lip-to-Speech Synthesis with Differential Digital Signal Processing [16.47490478732181]
We propose an end-to-end framework integrating acoustic inductive biases with differentiable speech generation components. Specifically, we introduce a fundamental frequency (F0) predictor to capture prosodic variations in synthesized speech. Our approach achieves satisfactory performance on speaker similarity without explicitly modelling speaker characteristics.
arXiv Detail & Related papers (2025-02-17T16:40:23Z)
DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech [43.45691362372739]
We propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS) DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties. Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.
arXiv Detail & Related papers (2024-09-18T09:36:55Z)
A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X) NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework. It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z)
Incremental FastPitch: Chunk-based High Quality Text to Speech [0.7366405857677227]
We propose Incremental FastPitch, a novel FastPitch variant capable of incrementally producing high-quality Mel chunks. Experimental results show that our proposal can produce speech quality comparable to the parallel FastPitch.
arXiv Detail & Related papers (2024-01-03T14:17:35Z)
HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis [39.892633589217326]
Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis. This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC)
arXiv Detail & Related papers (2023-11-21T09:07:11Z)
Improving Joint Speech-Text Representations Without Alignment [92.60384956736536]
We show that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length. We argue that consistency losses could forgive length differences and simply assume the best alignment.
arXiv Detail & Related papers (2023-08-11T13:28:48Z)
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z)
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker. We generate the mel-spectrogram of the edited speech with a transformer-based decoder. It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z)
Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR) The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem. Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z)
STYLER: Style Modeling with Rapidity and Robustness via SpeechDecomposition for Expressive and Controllable Neural Text to Speech [2.622482339911829]
STYLER is a novel expressive text-to-speech model with parallelized architecture. Our novel noise modeling approach from audio using domain adversarial training and Residual Decoding enabled style transfer without transferring noise.
arXiv Detail & Related papers (2021-03-17T07:11:09Z)
Incremental Text to Speech for Neural Sequence-to-Sequence Models using Reinforcement Learning [60.20205278845412]
Modern approaches to text to speech require the entire input character sequence to be processed before any audio is synthesised. This latency limits the suitability of such models for time-sensitive tasks like simultaneous interpretation. We propose a reinforcement learning based framework to train an agent to make this decision.
arXiv Detail & Related papers (2020-08-07T11:48:05Z)
End-to-End Adversarial Text-to-Speech [33.01223309795122]
We learn to synthesise speech from normalised text or phonemes in an end-to-end manner. Our proposed generator is feed-forward and thus efficient for both training and inference. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses.
arXiv Detail & Related papers (2020-06-05T17:41:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.