Neural HMMs are all you need (for high-quality attention-free TTS)
- URL: http://arxiv.org/abs/2108.13320v1
- Date: Mon, 30 Aug 2021 15:38:00 GMT
- Title: Neural HMMs are all you need (for high-quality attention-free TTS)
- Authors: Shivam Mehta, \'Eva Sz\'ekely, Jonas Beskow, Gustav Eje Henter
- Abstract summary: We discuss how to combine innovations from both classical and contemporary TTS for best results.
The final system is smaller and simpler than Tacotron 2 and learns to align and speak with fewer iterations.
Unlike Tacotron 2, it also allows easy control over speaking rate.
- Score: 13.467456334392594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural sequence-to-sequence TTS has demonstrated significantly better output
quality over classical statistical parametric speech synthesis using HMMs.
However, the new paradigm is not probabilistic and the use of non-monotonic
attention both increases training time and introduces "babbling" failure modes
that are unacceptable in production. In this paper, we demonstrate that the old
and new paradigms can be combined to obtain the advantages of both worlds, by
replacing the attention in Tacotron 2 with an autoregressive left-right no-skip
hidden-Markov model defined by a neural network. This leads to an HMM-based
neural TTS model with monotonic alignment, trained to maximise the full
sequence likelihood without approximations. We discuss how to combine
innovations from both classical and contemporary TTS for best results. The
final system is smaller and simpler than Tacotron 2 and learns to align and
speak with fewer iterations, while achieving the same speech naturalness.
Unlike Tacotron 2, it also allows easy control over speaking rate. Audio
examples and code are available at https://shivammehta007.github.io/Neural-HMM/
Related papers
- TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers [8.485772660435464]
We introduce a new variant of neural LM, namely TacoLM.
TacoLM introduces a gated attention mechanism to improve the training and inference efficiency.
TacoLM achieves a better word error rate, speaker similarity, and mean opinion score, with 90% fewer parameters and 5.2 times speed up, compared with VALL-E.
arXiv Detail & Related papers (2024-06-22T06:39:52Z) - EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech [4.91849983180793]
We propose a lightweight Text-to-Speech (TTS) system based on deep convolutional neural networks.
Our model consists of two stages: Text2Spectrum and SSRN.
Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech.
arXiv Detail & Related papers (2024-03-13T01:27:57Z) - Sparse Modular Activation for Efficient Sequence Modeling [94.11125833685583]
Recent models combining Linear State Space Models with self-attention mechanisms have demonstrated impressive results across a range of sequence modeling tasks.
Current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs.
We introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely activate sub-modules for sequence elements in a differentiable manner.
arXiv Detail & Related papers (2023-06-19T23:10:02Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and
Video Generation [70.74377373885645]
We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously.
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design.
Experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks.
arXiv Detail & Related papers (2022-12-19T14:11:52Z) - OverFlow: Putting flows on top of neural transducers for better TTS [9.346907121576258]
Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech.
In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics.
arXiv Detail & Related papers (2022-11-13T12:53:05Z) - Efficiently Trained Low-Resource Mongolian Text-to-Speech System Based
On FullConv-TTS [0.0]
We propose a new text-to-speech system based on deep convolutional neural networks that does not employ any RNN components (recurrent units)
At the same time, we improve the generality and robustness of our model through a series of data augmentation methods such as Time Warping, Frequency Mask, and Time Mask.
The final experimental results show that the TTS model using only the CNN component can reduce the training time compared to the classic TTS models such as Tacotron.
arXiv Detail & Related papers (2022-10-24T14:18:43Z) - Introducing the Hidden Neural Markov Chain framework [7.85426761612795]
This paper proposes the original Hidden Neural Markov Chain (HNMC) framework, a new family of sequential neural models.
We propose three different models: the classic HNMC, the HNMC2, and the HNMC-CN.
It shows this new neural sequential framework's potential, which can open the way to new models and might eventually compete with the prevalent BiLSTM and BiGRU.
arXiv Detail & Related papers (2021-02-17T20:13:45Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Investigation of learning abilities on linguistic features in
sequence-to-sequence text-to-speech synthesis [48.151894340550385]
Neural sequence-to-sequence text-to-speech synthesis (TTS) can produce high-quality speech directly from text or simple linguistic features such as phonemes.
We investigate under what conditions the neural sequence-to-sequence TTS can work well in Japanese and English.
arXiv Detail & Related papers (2020-05-20T23:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.