SpeakStream: Streaming Text-to-Speech with Interleaved Data
- URL: http://arxiv.org/abs/2505.19206v1
- Date: Sun, 25 May 2025 16:11:10 GMT
- Title: SpeakStream: Streaming Text-to-Speech with Interleaved Data
- Authors: Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly,
- Abstract summary: We present SpeakStream, a streaming TTS system that generates audio incrementally from streaming text using a decoder-only architecture.<n>During inference, SpeakStream generates speech incrementally while absorbing streaming input text.<n>Our experiments demonstrate that SpeakStream achieves state-of-the-art latency while maintaining the quality of non-streaming TTS systems.
- Score: 11.131427505801062
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The latency bottleneck of traditional text-to-speech (TTS) systems fundamentally hinders the potential of streaming large language models (LLMs) in conversational AI. These TTS systems, typically trained and inferenced on complete utterances, introduce unacceptable delays, even with optimized inference speeds, when coupled with streaming LLM outputs. This is particularly problematic for creating responsive conversational agents where low first-token latency is critical. In this paper, we present SpeakStream, a streaming TTS system that generates audio incrementally from streaming text using a decoder-only architecture. SpeakStream is trained using a next-step prediction loss on interleaved text-speech data. During inference, it generates speech incrementally while absorbing streaming input text, making it particularly suitable for cascaded conversational AI agents where an LLM streams text to a TTS system. Our experiments demonstrate that SpeakStream achieves state-of-the-art latency results in terms of first-token latency while maintaining the quality of non-streaming TTS systems.
Related papers
- StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model [20.978001644716063]
Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations.<n>Existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST)<n>We propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM)
arXiv Detail & Related papers (2025-07-10T14:28:39Z) - PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction [29.64357898080842]
Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses.<n>Their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences.<n>We propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time.
arXiv Detail & Related papers (2025-06-18T15:29:02Z) - StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling [50.537794606598254]
StreamMel is a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms.<n>It enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness.<n>It even achieves performance comparable to offline systems while supporting efficient real-time generation.
arXiv Detail & Related papers (2025-06-14T16:53:39Z) - Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling [76.23539797803681]
Existing methods primarily use a look mechanism, relying on future text to achieve natural streaming speech synthesis.<n>We propose LE, a streaming framework for generating high-quality speech frame-by-frame.<n> Experimental results suggest that the LE outperforms current streaming TTS methods and achieves comparable performance over sentence-level TTS systems.
arXiv Detail & Related papers (2025-05-26T08:25:01Z) - VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model [70.25062476543091]
VITA-Audio is an end-to-end large speech model with fast audio-text token generation.<n>MCTP module efficiently generates multiple audio tokens within a single model forward pass.<n>Four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality.
arXiv Detail & Related papers (2025-05-06T17:59:53Z) - SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation [14.57248739077317]
This paper proposes SimulS2S-LLM, which trains speech LLMs offline and employs a test-time policy to guide simultaneous inference.<n>SimulS2S-LLM achieves simultaneous speech-to-speech translation (Simul-S2ST) by predicting discrete output speech tokens and then synthesising output speech using a pre-trained vocoder.
arXiv Detail & Related papers (2025-04-22T01:05:32Z) - SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer [68.78023656892319]
This paper presents a dual-stream text-to-speech (TTS) model, SyncSpeech, capable of receiving streaming text input from upstream models while simultaneously generating streaming speech.<n>SyncSpeech has the following advantages: Low latency, as it begins generating streaming speech upon receiving the second text token; High efficiency, as it decodes all speech tokens corresponding to the each arrived text token in one step.
arXiv Detail & Related papers (2025-02-16T12:14:17Z) - A Non-autoregressive Generation Framework for End-to-End Simultaneous Speech-to-Speech Translation [48.84039953531355]
We propose a novel non-autoregressive generation framework for simultaneous speech translation (NAST-S2X)
NAST-S2X integrates speech-to-text and speech-to-speech tasks into a unified end-to-end framework.
It achieves high-quality simultaneous interpretation within a delay of less than 3 seconds and provides a 28 times decoding speedup in offline generation.
arXiv Detail & Related papers (2024-06-11T04:25:48Z) - StreamAtt: Direct Streaming Speech-to-Text Translation with Attention-based Audio History Selection [23.75894159181602]
Streaming speech-to-text translation (StreamST) is the task of automatically translating speech while incrementally receiving an audio stream.
We introduce StreamAtt, the first StreamST policy, and propose StreamLAAL, the first StreamST latency metric.
arXiv Detail & Related papers (2024-06-10T08:27:58Z) - Speak While You Think: Streaming Speech Synthesis During Text Generation [13.964169328257233]
Large Language Models (LLMs) demonstrate impressive capabilities, yet interaction with these models is mostly facilitated through text.
We propose LLM2Speech, an architecture to synthesize speech while text is being generated by an LLM which yields significant latency reduction.
arXiv Detail & Related papers (2023-09-20T11:00:15Z) - DiariST: Streaming Speech Translation with Speaker Diarization [53.595990270899414]
We propose DiariST, the first streaming ST and SD solution.
It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector.
Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech.
arXiv Detail & Related papers (2023-09-14T19:33:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.