AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit
Alignment
- URL: http://arxiv.org/abs/2003.01950v1
- Date: Wed, 4 Mar 2020 08:44:32 GMT
- Title: AlignTTS: Efficient Feed-Forward Text-to-Speech System without Explicit
Alignment
- Authors: Zhen Zeng, Jianzong Wang, Ning Cheng, Tian Xia, Jing Xiao
- Abstract summary: AlignTTS is based on a Feed-Forward Transformer which generates mel-spectrum from a sequence of characters, and the duration of each character is determined by a duration predictor.
Our model achieves not only state-of-the-art performance which outperforms Transformer TTS by 0.03 in mean option score (MOS), but also a high efficiency which is more than 50 times faster than real-time.
- Score: 38.85714892799518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Targeting at both high efficiency and performance, we propose AlignTTS to
predict the mel-spectrum in parallel. AlignTTS is based on a Feed-Forward
Transformer which generates mel-spectrum from a sequence of characters, and the
duration of each character is determined by a duration predictor.Instead of
adopting the attention mechanism in Transformer TTS to align text to
mel-spectrum, the alignment loss is presented to consider all possible
alignments in training by use of dynamic programming. Experiments on the
LJSpeech dataset show that our model achieves not only state-of-the-art
performance which outperforms Transformer TTS by 0.03 in mean option score
(MOS), but also a high efficiency which is more than 50 times faster than
real-time.
Related papers
- Test-Time Low Rank Adaptation via Confidence Maximization for Zero-Shot Generalization of Vision-Language Models [4.655740975414312]
This paper introduces Test-Time Low-rank adaptation (TTL) as an alternative to prompt tuning for zero-shot generalizations of large-scale vision-language models (VLMs)
TTL offers a test-time-efficient adaptation approach that updates the attention weights of the transformer by maximizing prediction confidence.
arXiv Detail & Related papers (2024-07-22T17:59:19Z) - DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer [9.032701216955497]
We present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders.
Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations.
We scale the training dataset and the model size to 82K hours and 790M parameters, respectively.
arXiv Detail & Related papers (2024-06-17T11:25:57Z) - TexIm FAST: Text-to-Image Representation for Semantic Similarity Evaluation using Transformers [2.7651063843287718]
TexIm FAST is a novel methodology for generating fixed-length representations through a self-supervised Variational Auto-Encoder (VAE) for semantic evaluation applying transformers (TexIm FAST)
The pictorial representations allow oblivious inference while retaining the linguistic intricacies, and are potent in cross-modal applications.
The efficacy of TexIm FAST has been extensively analyzed for the task of Semantic Textual Similarity (STS) upon the MSRPC, CNN/ Daily Mail, and XSum data-sets.
arXiv Detail & Related papers (2024-06-06T18:28:50Z) - Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Iterative pseudo-forced alignment by acoustic CTC loss for
self-supervised ASR domain adaptation [80.12316877964558]
High-quality data labeling from specific domains is costly and human time-consuming.
We propose a self-supervised domain adaptation method, based upon an iterative pseudo-forced alignment algorithm.
arXiv Detail & Related papers (2022-10-27T07:23:08Z) - AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity.
AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z) - Accurate Word Alignment Induction from Neural Machine Translation [33.21196289328584]
We propose two novel word alignment induction methods Shift-Att and Shift-AET.
The main idea is to induce alignments at the step when the to-be-aligned target token is the decoder input.
Experiments on three publicly available datasets demonstrate that both methods perform better than their corresponding neural baselines.
arXiv Detail & Related papers (2020-04-30T14:47:05Z) - Controllable Time-Delay Transformer for Real-Time Punctuation Prediction
and Disfluency Detection [10.265607222257263]
We propose a Controllable Time-delay Transformer (CT-Transformer) model that jointly completes the punctuation prediction and disfluency detection tasks in real time.
The proposed approach outperforms the previous state-of-the-art models on F-scores and achieves a competitive inference speed.
arXiv Detail & Related papers (2020-03-03T03:17:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.