LTS-VoiceAgent: A Listen-Think-Speak Framework for Efficient Streaming Voice Interaction via Semantic Triggering and Incremental Reasoning
- URL: http://arxiv.org/abs/2601.19952v1
- Date: Mon, 26 Jan 2026 15:42:35 GMT
- Title: LTS-VoiceAgent: A Listen-Think-Speak Framework for Efficient Streaming Voice Interaction via Semantic Triggering and Incremental Reasoning
- Authors: Wenhao Zou, Yuwei Miao, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Jingwen Xu,
- Abstract summary: LTS-VoiceAgent is a Listen-Think-Speak framework that separates when to think from how to reason incrementally.<n>It features a Dynamic Semantic Trigger to detect meaningful prefixes, and a Dual-Role Stream Orchestrator that coordinates a background Thinker and a foreground Speaker.
- Score: 27.13598270494417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-time voice agents face a dilemma: end-to-end models often lack deep reasoning, while cascaded pipelines incur high latency by executing ASR, LLM reasoning, and TTS strictly in sequence, unlike human conversation where listeners often start thinking before the speaker finishes. Since cascaded architectures remain the dominant choice for complex tasks, existing cascaded streaming strategies attempt to reduce this latency via mechanical segmentation (e.g., fixed chunks, VAD-based splitting) or speculative generation, but they frequently either break semantic units or waste computation on predictions that must be rolled back. To address these challenges, we propose LTS-VoiceAgent, a Listen-Think-Speak framework that explicitly separates when to think from how to reason incrementally. It features a Dynamic Semantic Trigger to detect meaningful prefixes, and a Dual-Role Stream Orchestrator that coordinates a background Thinker (for state maintenance) and a foreground Speaker (for speculative solving). This parallel design enables "thinking while speaking" without blocking responses. We also introduce a Pause-and-Repair benchmark containing natural disfluencies to stress-test streaming robustness. Experiments across VERA, Spoken-MQA, BigBenchAudio, and our benchmark show that LTS-VoiceAgent achieves a stronger accuracy-latency-efficiency trade-off than serial cascaded baselines and existing streaming strategies.
Related papers
- Discourse-Aware Dual-Track Streaming Response for Low-Latency Spoken Dialogue Systems [31.911085541071028]
We propose a low-latency architecture that enables listen-while-thinking and speak-while-thinking.<n>Experiments on two spoken dialogue benchmarks demonstrate that DDTSR reduces response latency by 19%-51%.
arXiv Detail & Related papers (2026-02-26T17:39:56Z) - Balancing Faithfulness and Performance in Reasoning via Multi-Listener Soft Execution [79.98699884805636]
Reasoning Execution by Multiple Listeners (REMUL) is a multi-party reinforcement learning approach.<n>REMUL builds on the hypothesis that reasoning traces which other parties can follow will be more faithful.<n>Speakers are rewarded for producing reasoning that is clear to listeners.
arXiv Detail & Related papers (2026-02-18T02:55:55Z) - TagSpeech: End-to-End Multi-Speaker ASR and Diarization with Fine-Grained Temporal Grounding [15.908533215017059]
We present TagSpeech, a unified framework that utilizes Temporal Anchor Grounding for joint multi-speaker ASR and diarization.<n>The framework is built on two key designs: (1) decoupled semantic and speaker streams fine-tuned via Serialized Output Training (SOT) to learn turn-taking dynamics; and (2) an interleaved time anchor mechanism that acts as a synchronization signal between semantic understanding and speaker tracking.
arXiv Detail & Related papers (2026-01-11T12:40:07Z) - AsyncVoice Agent: Real-Time Explanation for LLM Planning and Reasoning [27.522862635055077]
We present AsyncVoice Agent, a system whose asynchronous architecture decouples a streaming backend from a conversational voice.<n>This design allows narration and inference to run in parallel, empowering users to interrupt, query, and steer the model's reasoning process.<n>Objective benchmarks show this approach reduces interaction latency by more than 600x compared to monolithic baselines.
arXiv Detail & Related papers (2025-10-17T19:00:08Z) - Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage [66.67531241554546]
End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines.<n>We introduce the first approach to extend tool use directly into speech-in speech-out systems.<n>We propose Streaming Retrieval-Augmented Generation (Streaming RAG), a novel framework that reduces user-perceived latency by predicting tool queries in parallel with user speech.
arXiv Detail & Related papers (2025-10-02T14:18:20Z) - Chronological Thinking in Full-Duplex Spoken Dialogue Language Models [66.84843878538207]
Chronological Thinking aims to improve response quality in full SDLMs.<n>No additional latency: once the user stops speaking, the agent halts thinking and begins speaking without further delay.<n>Results: Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations.
arXiv Detail & Related papers (2025-10-02T10:28:11Z) - ChipChat: Low-Latency Cascaded Conversational Agent in MLX [34.30974874671028]
ChipChat is a novel low-latency CS that overcomes traditional bottlenecks through architectural innovations and streaming optimizations.<n>Our work shows that strategically redesigned CSs can overcome their historical latency limitations, offering a promising path forward for practical voice-based AI agents.
arXiv Detail & Related papers (2025-08-26T20:40:24Z) - STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models [131.90117151306993]
Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses.<n>Current SLMs lack the ability to perform an internal, unspoken thinking process before responding.<n>We propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks.
arXiv Detail & Related papers (2025-07-21T08:30:03Z) - StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling [50.537794606598254]
StreamMel is a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms.<n>It enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness.<n>It even achieves performance comparable to offline systems while supporting efficient real-time generation.
arXiv Detail & Related papers (2025-06-14T16:53:39Z) - Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM [3.6950912517562435]
We propose a method that implicitly internalizes ASR chain of thought into a speech LLM, enhancing its native speech understanding capabilities.
Our approach reduces latency and improves the model's native understanding of speech, paving the way for more efficient and natural real-time audio interactions.
arXiv Detail & Related papers (2024-09-25T20:59:12Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.