Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis
Including Unsupervised Duration Modeling
- URL: http://arxiv.org/abs/2010.04301v4
- Date: Tue, 11 May 2021 04:12:14 GMT
- Title: Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis
Including Unsupervised Duration Modeling
- Authors: Jonathan Shen, Ye Jia, Mike Chrzanowski, Yu Zhang, Isaac Elias, Heiga
Zen, Yonghui Wu
- Abstract summary: Non-Attentive Tacotron achieves a 5-scale mean opinion score for naturalness of 4.41, slightly outperforming Tacotron 2.
The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time.
- Score: 29.24636059952458
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: This paper presents Non-Attentive Tacotron based on the Tacotron 2
text-to-speech model, replacing the attention mechanism with an explicit
duration predictor. This improves robustness significantly as measured by
unaligned duration ratio and word deletion rate, two metrics introduced in this
paper for large-scale robustness evaluation using a pre-trained speech
recognition model. With the use of Gaussian upsampling, Non-Attentive Tacotron
achieves a 5-scale mean opinion score for naturalness of 4.41, slightly
outperforming Tacotron 2. The duration predictor enables both utterance-wide
and per-phoneme control of duration at inference time. When accurate target
durations are scarce or unavailable in the training data, we propose a method
using a fine-grained variational auto-encoder to train the duration predictor
in a semi-supervised or unsupervised manner, with results almost as good as
supervised training.
Related papers
- UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation [93.38604803625294]
We present UncertaintyRAG, a novel approach for long-context Retrieval-Augmented Generation (RAG)
We use Signal-to-Noise Ratio (SNR)-based span uncertainty to estimate similarity between text chunks.
UncertaintyRAG outperforms baselines by 2.03% on LLaMA-2-7B, achieving state-of-the-art results.
arXiv Detail & Related papers (2024-10-03T17:39:38Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Improving Adaptive Conformal Prediction Using Self-Supervised Learning [72.2614468437919]
We train an auxiliary model with a self-supervised pretext task on top of an existing predictive model and use the self-supervised error as an additional feature to estimate nonconformity scores.
We empirically demonstrate the benefit of the additional information using both synthetic and real data on the efficiency (width), deficit, and excess of conformal prediction intervals.
arXiv Detail & Related papers (2023-02-23T18:57:14Z) - Robust Time Series Dissimilarity Measure for Outlier Detection and
Periodicity Detection [16.223509730658513]
We propose a novel time series dissimilarity measure named RobustDTW to reduce the effects of noises and outliers.
Specifically, the RobustDTW estimates the trend and optimize the time warp in an alternating manner by utilizing our designed temporal graph trend filtering.
Experiments on real-world datasets demonstrate the superior performance of RobustDTW compared to DTW variants in both outlier time series detection and periodicity detection.
arXiv Detail & Related papers (2022-06-07T00:49:16Z) - Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming
Disfluency Detection [3.884530687475798]
Streaming BERT-based sequence tagging model is capable of detecting disfluencies in real-time.
Model attains state-of-the-art latency and stability scores when compared with recent work on incremental disfluency detection.
arXiv Detail & Related papers (2022-05-02T02:13:24Z) - Regotron: Regularizing the Tacotron2 architecture via monotonic
alignment loss [71.30589161727967]
We introduce Regotron, a regularized version of Tacotron2, which aims to alleviate the training issues and at the same time produce monotonic alignments.
Our method augments the vanilla Tacotron2 objective function with an additional term, which penalizes non-monotonic alignments in the location-sensitive attention mechanism.
arXiv Detail & Related papers (2022-04-28T12:08:53Z) - Consistency Regularization for Certified Robustness of Smoothed
Classifiers [89.72878906950208]
A recent technique of randomized smoothing has shown that the worst-case $ell$-robustness can be transformed into the average-case robustness.
We found that the trade-off between accuracy and certified robustness of smoothed classifiers can be greatly controlled by simply regularizing the prediction consistency over noise.
arXiv Detail & Related papers (2020-06-07T06:57:43Z) - End-to-End Adversarial Text-to-Speech [33.01223309795122]
We learn to synthesise speech from normalised text or phonemes in an end-to-end manner.
Our proposed generator is feed-forward and thus efficient for both training and inference.
It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses.
arXiv Detail & Related papers (2020-06-05T17:41:05Z) - Spike-Triggered Non-Autoregressive Transformer for End-to-End Speech
Recognition [66.47000813920617]
We propose a spike-triggered non-autoregressive transformer model for end-to-end speech recognition.
The proposed model can accurately predict the length of the target sequence and achieve a competitive performance.
The model even achieves a real-time factor of 0.0056, which exceeds all mainstream speech recognition models.
arXiv Detail & Related papers (2020-05-16T08:27:20Z) - Long-Short Term Spatiotemporal Tensor Prediction for Passenger Flow
Profile [15.875569404476495]
We focus on a tensor-based prediction and propose several practical techniques to improve prediction.
For long-term prediction specifically, we propose the "Tensor Decomposition + 2-Dimensional Auto-Regressive Moving Average (2D-ARMA)" model.
For short-term prediction, we propose to conduct tensor completion based on tensor clustering to avoid oversimplifying and ensure accuracy.
arXiv Detail & Related papers (2020-04-23T08:30:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.