Regotron: Regularizing the Tacotron2 architecture via monotonic
alignment loss
- URL: http://arxiv.org/abs/2204.13437v1
- Date: Thu, 28 Apr 2022 12:08:53 GMT
- Title: Regotron: Regularizing the Tacotron2 architecture via monotonic
alignment loss
- Authors: Efthymios Georgiou, Kosmas Kritsis, Georgios Paraskevopoulos,
Athanasios Katsamanis, Vassilis Katsouros, Alexandros Potamianos
- Abstract summary: We introduce Regotron, a regularized version of Tacotron2, which aims to alleviate the training issues and at the same time produce monotonic alignments.
Our method augments the vanilla Tacotron2 objective function with an additional term, which penalizes non-monotonic alignments in the location-sensitive attention mechanism.
- Score: 71.30589161727967
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent deep learning Text-to-Speech (TTS) systems have achieved impressive
performance by generating speech close to human parity. However, they suffer
from training stability issues as well as incorrect alignment of the
intermediate acoustic representation with the input text sequence. In this
work, we introduce Regotron, a regularized version of Tacotron2 which aims to
alleviate the training issues and at the same time produce monotonic
alignments. Our method augments the vanilla Tacotron2 objective function with
an additional term, which penalizes non-monotonic alignments in the
location-sensitive attention mechanism. By properly adjusting this
regularization term we show that the loss curves become smoother, and at the
same time Regotron consistently produces monotonic alignments in unseen
examples even at an early stage (13\% of the total number of epochs) of its
training process, whereas the fully converged Tacotron2 fails to do so.
Moreover, our proposed regularization method has no additional computational
overhead, while reducing common TTS mistakes and achieving slighlty improved
speech naturalness according to subjective mean opinion scores (MOS) collected
from 50 evaluators.
Related papers
- Very Attentive Tacotron: Robust and Unbounded Length Generalization in Autoregressive Transformer-Based Text-to-Speech [9.982121768809854]
We introduce enhancements aimed at AR Transformer-based encoder-decoder text-to-speech systems.
Our approach uses an alignment mechanism to provide cross-attention operations with relative location information.
A system incorporating these improvements, which we call Very Attentive Tacotron, matches the naturalness and expressiveness of a baseline T5-based TTS system.
arXiv Detail & Related papers (2024-10-29T16:17:01Z) - Efficient local linearity regularization to overcome catastrophic
overfitting [59.463867084204566]
Catastrophic overfitting (CO) in single-step adversarial training results in abrupt drops in the adversarial test accuracy (even down to 0%)
We introduce a regularization term, called ELLE, to mitigate CO effectively and efficiently in classical AT evaluations.
arXiv Detail & Related papers (2024-01-21T22:55:26Z) - PTP: Boosting Stability and Performance of Prompt Tuning with
Perturbation-Based Regularizer [94.23904400441957]
We introduce perturbation-based regularizers, which can smooth the loss landscape, into prompt tuning.
We design two kinds of perturbation-based regularizers, including random-noise-based and adversarial-based.
Our new algorithms improve the state-of-the-art prompt tuning methods by 1.94% and 2.34% on SuperGLUE and FewGLUE benchmarks, respectively.
arXiv Detail & Related papers (2023-05-03T20:30:51Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - One TTS Alignment To Rule Them All [26.355019468082247]
Speech-to-text alignment is a critical component of neural textto-speech (TTS) models.
In this paper we leverage the alignment mechanism proposed in RAD-TTS as a generic alignment learning framework.
The framework combines forward-sum algorithm, the Viterbi algorithm, and a simple and efficient static prior.
arXiv Detail & Related papers (2021-08-23T23:45:48Z) - Flavored Tacotron: Conditional Learning for Prosodic-linguistic Features [1.6286844497313562]
We propose a strategy for conditioning Tacotron-2 on two fundamental prosodic features in English -- stress syllable and pitch accent.
We show that jointly conditioned features at pre-encoder and intra-decoder stages result in prosodically natural synthesized speech.
arXiv Detail & Related papers (2021-04-08T20:50:15Z) - Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis
Including Unsupervised Duration Modeling [29.24636059952458]
Non-Attentive Tacotron achieves a 5-scale mean opinion score for naturalness of 4.41, slightly outperforming Tacotron 2.
The duration predictor enables both utterance-wide and per-phoneme control of duration at inference time.
arXiv Detail & Related papers (2020-10-08T23:41:39Z) - Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based
TTS [74.11899135025503]
We extend the Tacotron-based speech synthesis framework to explicitly model the prosodic phrase breaks.
We show that our proposed training scheme consistently improves the voice quality for both Chinese and Mongolian systems.
arXiv Detail & Related papers (2020-08-11T07:57:29Z) - Exact Hard Monotonic Attention for Character-Level Transduction [76.66797368985453]
We show that neural sequence-to-sequence models that use non-monotonic soft attention often outperform popular monotonic models.
We develop a hard attention sequence-to-sequence model that enforces strict monotonicity and learns a latent alignment jointly while learning to transduce.
arXiv Detail & Related papers (2019-05-15T17:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.