Related papers: VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency

VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency

URL: http://arxiv.org/abs/2509.15969v1
Date: Fri, 19 Sep 2025 13:26:46 GMT
Title: VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
Authors: Nikita Torgashov, Gustav Eje Henter, Gabriel Skantze,
Abstract summary: We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word.<n>VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset.<n>Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on
Score: 17.067283475630095
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment scheme and a dynamic look-ahead that does not delay onset. Built around an incremental phoneme transformer, a temporal transformer predicting semantic and duration tokens, and a depth transformer producing acoustic tokens, VoXtream achieves, to our knowledge, the lowest initial delay among publicly available streaming TTS: 102 ms on GPU. Despite being trained on a mid-scale 9k-hour corpus, it matches or surpasses larger baselines on several metrics, while delivering competitive quality in both output- and full-streaming settings. Demo and code are available at https://herimor.github.io/voxtream.

Related papers

Voxtral Realtime [134.66962524291424]
Voxtral Realtime is a streaming automatic speech recognition model.<n>It matches offline transcription quality at sub-second latency.<n>We release the model weights under the Apache 2.0 license.
arXiv Detail & Related papers (2026-02-11T19:17:10Z)
Qwen3-TTS Technical Report [64.94647392030824]
We present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models.<n>Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control.<n>Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers.
arXiv Detail & Related papers (2026-01-22T03:51:43Z)
StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling [50.537794606598254]
StreamMel is a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms.<n>It enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness.<n>It even achieves performance comparable to offline systems while supporting efficient real-time generation.
arXiv Detail & Related papers (2025-06-14T16:53:39Z)
SpeakStream: Streaming Text-to-Speech with Interleaved Data [11.131427505801062]
We present SpeakStream, a streaming TTS system that generates audio incrementally from streaming text using a decoder-only architecture.<n>During inference, SpeakStream generates speech incrementally while absorbing streaming input text.<n>Our experiments demonstrate that SpeakStream achieves state-of-the-art latency while maintaining the quality of non-streaming TTS systems.
arXiv Detail & Related papers (2025-05-25T16:11:10Z)
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens [31.575335190916995]
We introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech that decomposes speech into two complementary token types.<n>To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations.
arXiv Detail & Related papers (2025-03-03T16:23:10Z)
SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer [68.78023656892319]
This paper presents a dual-stream text-to-speech (TTS) model, SyncSpeech, capable of receiving streaming text input from upstream models while simultaneously generating streaming speech.<n>SyncSpeech has the following advantages: Low latency, as it begins generating streaming speech upon receiving the second text token; High efficiency, as it decodes all speech tokens corresponding to the each arrived text token in one step.
arXiv Detail & Related papers (2025-02-16T12:14:17Z)
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z)
MultiSpeech: Multi-Speaker Text to Speech with Transformer [145.56725956639232]
Transformer-based text to speech (TTS) model (e.g., Transformer TTSciteli 2019neural, FastSpeechciteren 2019fastspeech) has shown the advantages of training and inference efficiency over RNN-based model. We develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment.
arXiv Detail & Related papers (2020-06-08T15:05:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.