Related papers: Qwen3-TTS Technical Report

Qwen3-TTS Technical Report

URL: http://arxiv.org/abs/2601.15621v1
Date: Thu, 22 Jan 2026 03:51:43 GMT
Title: Qwen3-TTS Technical Report
Authors: Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, Xinyu Zhang, Pei Zhang, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin,
Abstract summary: We present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models.<n>Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control.<n>Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers.
Score: 64.94647392030824
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ($97\,\mathrm{ms}$) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.

Related papers

Qwen3-Omni Technical Report [105.11829337290249]
We present Qwen3- Omni, a single multimodal model that maintains state-of-the-art performance across text, image, audio, and video.<n>Qwen3- Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks.<n>It supports text interaction in 119 languages, speech understanding in 19 languages, and speech generation in 10 languages.
arXiv Detail & Related papers (2025-09-22T13:26:24Z)
TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling [13.05578634768109]
We introduce the Text-aware Diffusion Transformer Speech Codec (TaDiCodec)<n>TaDiCodec employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder.<n>It achieves an extremely low frame rate of 6.25 Hz and corresponding compression of 0.0875 kbps with a single-layer codebook for 24 kHz speech.
arXiv Detail & Related papers (2025-08-22T20:45:03Z)
UniTTS: An end-to-end TTS system without decoupling of acoustic and semantic information [12.991605203384458]
We propose DistilCodec and UniTTS, which collectively offer the following advantages.<n>DistilCodec distills a multi-codebook audio into a single-codebook audio with 32 codes while achieving a near 100% utilization.<n>UniTTS employs a three-stage training process: Pre-Training, Supervised Fine-Tuning (SFT), and Alignment.
arXiv Detail & Related papers (2025-05-23T03:13:46Z)
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens [31.575335190916995]
We introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech that decomposes speech into two complementary token types.<n>To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations.
arXiv Detail & Related papers (2025-03-03T16:23:10Z)
Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [83.0622534215881]
This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities.<n>Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures.<n>Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL.<n>Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios.
arXiv Detail & Related papers (2025-02-26T17:26:36Z)
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis [56.25862714128288]
This paper introduces textitMegaTTS 3, a zero-shot text-to-speech (TTS) system featuring an innovative sparse alignment algorithm.<n>Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space.<n>Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity.
arXiv Detail & Related papers (2025-02-26T08:22:00Z)
IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System [6.686126079510178]
We introduce the IndexTTS system, which is mainly based on the XTTS and Tortoise model.<n>Specifically, in Chinese scenarios, we adopt a hybrid modeling method that combines characters and pinyin.<n>Compared with XTTS, it has achieved significant improvements in naturalness, content consistency, and zero-shot voice cloning.
arXiv Detail & Related papers (2025-02-08T10:23:20Z)
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data. We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z)
MultiSpeech: Multi-Speaker Text to Speech with Transformer [145.56725956639232]
Transformer-based text to speech (TTS) model (e.g., Transformer TTSciteli 2019neural, FastSpeechciteren 2019fastspeech) has shown the advantages of training and inference efficiency over RNN-based model. We develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment.
arXiv Detail & Related papers (2020-06-08T15:05:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.