TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch
- URL: http://arxiv.org/abs/2412.08237v2
- Date: Thu, 12 Dec 2024 10:01:11 GMT
- Title: TouchTTS: An Embarrassingly Simple TTS Framework that Everyone Can Touch
- Authors: Xingchen Song, Mengtao Xing, Changwei Ma, Shengqiang Li, Di Wu, Binbin Zhang, Fuping Pan, Dinghao Zhou, Yuekai Zhang, Shun Lei, Zhendong Peng, Zhiyong Wu,
- Abstract summary: Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data.
In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline.
This pipeline maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50%.
- Score: 18.661974399115007
- License:
- Abstract: It is well known that LLM-based systems are data-hungry. Recent LLM-based TTS works typically employ complex data processing pipelines to obtain high-quality training data. These sophisticated pipelines require excellent models at each stage (e.g., speech denoising, speech enhancement, speaker diarization, and punctuation models), which themselves demand high-quality training data and are rarely open-sourced. Even with state-of-the-art models, issues persist, such as incomplete background noise removal and misalignment between punctuation and actual speech pauses. Moreover, the stringent filtering strategies often retain only 10-30\% of the original data, significantly impeding data scaling efforts. In this work, we leverage a noise-robust audio tokenizer (S3Tokenizer) to design a simplified yet effective TTS data processing pipeline that maintains data quality while substantially reducing data acquisition costs, achieving a data retention rate of over 50\%. Beyond data scaling challenges, LLM-based TTS systems also incur higher deployment costs compared to conventional approaches. Current systems typically use LLMs solely for text-to-token generation, while requiring separate models (e.g., flow matching models) for token-to-waveform generation, which cannot be directly executed by LLM inference engines, further complicating deployment. To address these challenges, we eliminate redundant modules in both LLM and flow components, replacing the flow model backbone with an LLM architecture. Building upon this simplified flow backbone, we propose a unified architecture for both streaming and non-streaming inference, significantly reducing deployment costs. Finally, we explore the feasibility of unifying TTS and ASR tasks using the same data for training, thanks to the simplified pipeline and the S3Tokenizer that reduces the quality requirements for TTS training data.
Related papers
- FastMCTS: A Simple Sampling Strategy for Data Synthesis [67.60823802317141]
We introduce FastMCTS, an innovative data synthesis strategy inspired by Monte Carlo Tree Search.
FastMCTS provides a more efficient sampling method for multi-step reasoning data, offering step-level evaluation signals.
Experiments on both English and Chinese reasoning datasets demonstrate that FastMCTS generates over 30% more correct reasoning paths.
arXiv Detail & Related papers (2025-02-17T06:27:57Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Fast Streaming Transducer ASR Prototyping via Knowledge Distillation with Whisper [3.717584661565119]
We demonstrate that streaming Transformer-Transducer (TT) models can be trained from scratch without supervised data.
This allows training a robust ASR model just in one stage and does not require large data and computational budget.
We validate the proposed framework on 6 languages from CommonVoice and propose multiple filters to filter out hallucinated PLs.
arXiv Detail & Related papers (2024-09-20T13:38:59Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition [21.516152600963775]
Denoising LM (DLM) is a $textitscaled$ error correction model trained with vast amounts of synthetic data.
DLM achieves 1.5% word error rate (WER) on $textittest-clean$ and 3.3% WER on $textittest-other$ on Librispeech.
arXiv Detail & Related papers (2024-05-24T05:05:12Z) - LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement [79.31084387589968]
Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks.
We propose LLM2LLM, a data augmentation strategy that uses a teacher LLM to enhance a small seed dataset.
We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime.
arXiv Detail & Related papers (2024-03-22T08:57:07Z) - EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech [4.91849983180793]
We propose a lightweight Text-to-Speech (TTS) system based on deep convolutional neural networks.
Our model consists of two stages: Text2Spectrum and SSRN.
Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech.
arXiv Detail & Related papers (2024-03-13T01:27:57Z) - On the Relevance of Phoneme Duration Variability of Synthesized Training
Data for Automatic Speech Recognition [0.552480439325792]
We focus on the temporal structure of synthetic data and its relation to ASR training.
We show how much the degradation of synthetic data quality is influenced by duration modeling in non-autoregressive TTS.
Using a simple algorithm we shift phoneme duration distributions of the TTS system closer to real durations.
arXiv Detail & Related papers (2023-10-12T08:45:21Z) - Comparative Analysis of Transfer Learning in Deep Learning
Text-to-Speech Models on a Few-Shot, Low-Resource, Customized Dataset [10.119929769316565]
This thesis is rooted in the pressing need to find TTS models that require less training time, fewer data samples, yet yield high-quality voice output.
The research evaluates TTS state-of-the-art model transfer learning capabilities through a thorough technical analysis.
It then conducts a hands-on experimental analysis to compare models' performance in a constrained dataset.
arXiv Detail & Related papers (2023-10-08T03:08:25Z) - Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks.
We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset.
The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z) - Task Residual for Tuning Vision-Language Models [69.22958802711017]
We propose a new efficient tuning approach for vision-language models (VLMs) named Task Residual Tuning (TaskRes)
TaskRes explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task.
The proposed TaskRes is simple yet effective, which significantly outperforms previous methods on 11 benchmark datasets.
arXiv Detail & Related papers (2022-11-18T15:09:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.