TTS-1 Technical Report
- URL: http://arxiv.org/abs/2507.21138v1
- Date: Tue, 22 Jul 2025 23:57:11 GMT
- Title: TTS-1 Technical Report
- Authors: Oleg Atamanenko, Anna Chalova, Joseph Coombes, Nikki Cope, Phillip Dang, Zhifeng Deng, Jimmy Du, Michael Ermolenko, Feifan Fan, Yufei Feng, Cheryl Fichter, Pavel Filimonov, Louis Fischer, Kylan Gibbs, Valeria Gusarova, Pavel Karpik, Andreas Assad Kottner, Ian Lee, Oliver Louie, Jasmine Mai, Mikhail Mamontov, Suri Mao, Nurullah Morshed, Igor Poletaev, Florin Radu, Dmytro Semernia, Evgenii Shingarev, Vikram Sivaraja, Peter Skirko, Rinat Takhautdinov, Robert Villahermosa, Jean Wang,
- Abstract summary: Inworld TTS-1 is a set of two Transformer-based autoregressive text-to-speech models.<n>TTS-1 and TTS-1-Max can generate high-resolution 48 kHz speech with low latency.<n>We additionally open-source our training and modeling code under an MIT license.
- Score: 0.9134656817901009
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Inworld TTS-1, a set of two Transformer-based autoregressive text-to-speech (TTS) models. Our largest model, TTS-1-Max, has 8.8B parameters and is designed for utmost quality and expressiveness in demanding applications. TTS-1 is our most efficient model, with 1.6B parameters, built for real-time speech synthesis and on-device use cases. By scaling train-time compute and applying a sequential process of pre-training, fine-tuning, and RL-alignment of the speech-language model (SpeechLM) component, both models achieve state-of-the-art performance on a variety of benchmarks, demonstrating exceptional quality relying purely on in-context learning of the speaker's voice. Inworld TTS-1 and TTS-1-Max can generate high-resolution 48 kHz speech with low latency, and support 11 languages with fine-grained emotional control and non-verbal vocalizations through audio markups. We additionally open-source our training and modeling code under an MIT license.
Related papers
- MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis [56.25862714128288]
This paper introduces textitMegaTTS 3, a zero-shot text-to-speech (TTS) system featuring an innovative sparse alignment algorithm.<n>Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space.<n>Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity.
arXiv Detail & Related papers (2025-02-26T08:22:00Z) - Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement [17.020173869112163]
We present a lightweight and stable zero-shot Text-To-Speech (TTS) system.<n>We introduce a novel TTS architecture designed to effectively model linguistic content and speaker attributes from source speech and prompt speech.
arXiv Detail & Related papers (2025-01-15T04:17:48Z) - DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors [8.419383213705789]
We introduce DiTTo-TTS, a Diffusion Transformer (DiT)-based TTS model, to investigate whether LDM-based TTS can achieve state-of-the-art performance without domain-specific factors.<n>We find that DiT with minimal modifications outperforms U-Net, variable-length modeling with a speech length predictor, and conditions like semantic alignment in speech latent representations are key to further enhancement.
arXiv Detail & Related papers (2024-06-17T11:25:57Z) - EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech [4.91849983180793]
We propose a lightweight Text-to-Speech (TTS) system based on deep convolutional neural networks.
Our model consists of two stages: Text2Spectrum and SSRN.
Experiments show that our model can reduce the training time and parameters while ensuring the quality and naturalness of the synthesized speech.
arXiv Detail & Related papers (2024-03-13T01:27:57Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Pheme: Efficient and Conversational Speech Generation [52.34331755341856]
We introduce the Pheme model series that offers compact yet high-performing conversational TTS models.
It can be trained efficiently on smaller-scale conversational data, cutting data demands by more than 10x but still matching the quality of the autoregressive TTS models.
arXiv Detail & Related papers (2024-01-05T14:47:20Z) - Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive
Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data.
We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z) - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers [92.55131711064935]
We introduce a language modeling approach for text to speech synthesis (TTS)
Specifically, we train a neural language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio model.
Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech.
arXiv Detail & Related papers (2023-01-05T15:37:15Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice
Conversion for everyone [0.7927630381442314]
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS.
We achieve state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset.
It is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality.
arXiv Detail & Related papers (2021-12-04T19:50:29Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.