Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis
- URL: http://arxiv.org/abs/2011.03568v2
- Date: Fri, 5 Feb 2021 19:07:32 GMT
- Title: Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis
- Authors: Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad,
Diederik P. Kingma
- Abstract summary: We describe a sequence-to-sequence neural network which directly generates speech waveforms from text inputs.
The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop.
Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system.
- Score: 25.234945748885348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We describe a sequence-to-sequence neural network which directly generates
speech waveforms from text inputs. The architecture extends the Tacotron model
by incorporating a normalizing flow into the autoregressive decoder loop.
Output waveforms are modeled as a sequence of non-overlapping fixed-length
blocks, each one containing hundreds of samples. The interdependencies of
waveform samples within each block are modeled using the normalizing flow,
enabling parallel training and synthesis. Longer-term dependencies are handled
autoregressively by conditioning each flow on preceding blocks.This model can
be optimized directly with maximum likelihood, with-out using intermediate,
hand-designed features nor additional loss terms. Contemporary state-of-the-art
text-to-speech (TTS) systems use a cascade of separately learned models: one
(such as Tacotron) which generates intermediate features (such as spectrograms)
from text, followed by a vocoder (such as WaveRNN) which generates waveform
samples from the intermediate features. The proposed system, in contrast, does
not use a fixed intermediate representation, and learns all parameters
end-to-end. Experiments show that the proposed model generates speech with
quality approaching a state-of-the-art neural TTS system, with significantly
improved generation speed.
Related papers
- Utilizing Neural Transducers for Two-Stage Text-to-Speech via Semantic
Token Prediction [15.72317249204736]
We propose a novel text-to-speech (TTS) framework centered around a neural transducer.
Our approach divides the whole TTS pipeline into semantic-level sequence-to-sequence (seq2seq) modeling and fine-grained acoustic modeling stages.
Our experimental results on zero-shot adaptive TTS demonstrate that our model surpasses the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-01-03T02:03:36Z) - DiffAR: Denoising Diffusion Autoregressive Model for Raw Speech Waveform
Generation [25.968115316199246]
This work proposes a diffusion probabilistic end-to-end model for generating a raw speech waveform.
Our model is autoregressive, generating overlapping frames sequentially, where each frame is conditioned on a portion of the previously generated one.
Experiments show that the proposed model generates speech with superior quality compared with other state-of-the-art neural speech generation systems.
arXiv Detail & Related papers (2023-10-02T17:42:22Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - Period VITS: Variational Inference with Explicit Pitch Modeling for
End-to-end Emotional Speech Synthesis [19.422230767803246]
We propose Period VITS, a novel end-to-end text-to-speech model that incorporates an explicit periodicity generator.
In the proposed method, we introduce a frame pitch predictor that predicts prosodic features, such as pitch and voicing flags, from the input text.
From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch.
arXiv Detail & Related papers (2022-10-28T07:52:30Z) - Adaptive re-calibration of channel-wise features for Adversarial Audio
Classification [0.0]
We propose a recalibration of features using attention feature fusion for synthetic speech detection.
We compare its performance against different detection methods including End2End models and Resnet-based models.
We also demonstrate that the combination of Linear frequency cepstral coefficients (LFCC) and Mel Frequency cepstral coefficients (MFCC) using the attentional feature fusion technique creates better input features representations.
arXiv Detail & Related papers (2022-10-21T04:21:56Z) - FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech
Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis.
FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies.
Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z) - Differentiable Duration Modeling for End-to-End Text-to-Speech [6.571447892202893]
parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis.
We propose a differentiable duration method for learning monotonic sequences between input and output.
Our model learns to perform high-fidelity synthesis through a combination of adversarial training and matching the total ground-truth duration.
arXiv Detail & Related papers (2022-03-21T15:14:44Z) - Adaptation of Tacotron2-based Text-To-Speech for
Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging [48.7576911714538]
This paper experiments with transfer learning and adaptation of a Tacotron2 text-to-speech model to improve articulatory-to-acoustic mapping.
We use a multi-speaker pre-trained Tacotron2 TTS model and a pre-trained WaveGlow neural vocoder.
arXiv Detail & Related papers (2021-07-26T09:19:20Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.