AsyncVoice Agent: Real-Time Explanation for LLM Planning and Reasoning
- URL: http://arxiv.org/abs/2510.16156v1
- Date: Fri, 17 Oct 2025 19:00:08 GMT
- Title: AsyncVoice Agent: Real-Time Explanation for LLM Planning and Reasoning
- Authors: Yueqian Lin, Zhengmian Hu, Jayakumar Subramanian, Qinsi Wang, Nikos Vlassis, Hai "Helen" Li, Yiran Chen,
- Abstract summary: We present AsyncVoice Agent, a system whose asynchronous architecture decouples a streaming backend from a conversational voice.<n>This design allows narration and inference to run in parallel, empowering users to interrupt, query, and steer the model's reasoning process.<n>Objective benchmarks show this approach reduces interaction latency by more than 600x compared to monolithic baselines.
- Score: 27.522862635055077
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Effective human-AI collaboration on complex reasoning tasks requires that users understand and interact with the model's process, not just receive an output. However, the monolithic text from methods like Chain-of-Thought (CoT) prevents this, as current interfaces lack real-time verbalization and robust user barge-in. We present AsyncVoice Agent, a system whose asynchronous architecture decouples a streaming LLM backend from a conversational voice frontend. This design allows narration and inference to run in parallel, empowering users to interrupt, query, and steer the model's reasoning process at any time. Objective benchmarks show this approach reduces interaction latency by more than 600x compared to monolithic baselines while ensuring high fidelity and competitive task accuracy. By enabling a two-way dialogue with a model's thought process, AsyncVoice Agent offers a new paradigm for building more effective, steerable, and trustworthy human-AI systems for high-stakes tasks.
Related papers
- U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation [48.6868174403074]
We introduce U-Mind, the first unified system for high-intelligence multimodal dialogue.<n>It supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop.<n>We show that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks.
arXiv Detail & Related papers (2026-02-27T07:07:02Z) - ChatUMM: Robust Context Tracking for Conversational Interleaved Generation [44.19929499646892]
Unified multimodal models (UMMs) have achieved remarkable progress yet remain constrained by a single-turn interaction paradigm.<n>We present ChatUMM, a conversational unified model that excels at robust context tracking to sustain interleaved multimodal generation.<n>ChatUMM derives its capabilities from an interleaved multi-turn training strategy that models serialized text-image streams as a continuous conversational flow.
arXiv Detail & Related papers (2026-02-06T07:11:50Z) - LTS-VoiceAgent: A Listen-Think-Speak Framework for Efficient Streaming Voice Interaction via Semantic Triggering and Incremental Reasoning [27.13598270494417]
LTS-VoiceAgent is a Listen-Think-Speak framework that separates when to think from how to reason incrementally.<n>It features a Dynamic Semantic Trigger to detect meaningful prefixes, and a Dual-Role Stream Orchestrator that coordinates a background Thinker and a foreground Speaker.
arXiv Detail & Related papers (2026-01-26T15:42:35Z) - Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage [66.67531241554546]
End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines.<n>We introduce the first approach to extend tool use directly into speech-in speech-out systems.<n>We propose Streaming Retrieval-Augmented Generation (Streaming RAG), a novel framework that reduces user-perceived latency by predicting tool queries in parallel with user speech.
arXiv Detail & Related papers (2025-10-02T14:18:20Z) - Chronological Thinking in Full-Duplex Spoken Dialogue Language Models [66.84843878538207]
Chronological Thinking aims to improve response quality in full SDLMs.<n>No additional latency: once the user stops speaking, the agent halts thinking and begins speaking without further delay.<n>Results: Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations.
arXiv Detail & Related papers (2025-10-02T10:28:11Z) - DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving [14.700431530275132]
We introduce DroidSpeak, the first distributed LLM inference system that enables KV cache reuse across distributed nodes.<n>We show that DroidSpeak achieves up to 4x throughput improvement and about 3.1x faster prefill (time to first token)<n>Experiments on diverse datasets and model pairs demonstrate that DroidSpeak achieves up to 4x throughput improvement and about 3.1x faster prefill.
arXiv Detail & Related papers (2024-11-05T05:41:41Z) - Asynchronous Tool Usage for Real-Time Agents [61.3041983544042]
We introduce asynchronous AI agents capable of parallel processing and real-time tool-use.
Our key contribution is an event-driven finite-state machine architecture for agent execution and prompting.
This work presents both a conceptual framework and practical tools for creating AI agents capable of fluid, multitasking interactions.
arXiv Detail & Related papers (2024-10-28T23:57:19Z) - Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models [0.0]
We introduce a dynamic benchmarking system for conversational agents that evaluates their performance through a single, simulated, and lengthy user interaction.
We context switch regularly to interleave the tasks, which constructs a realistic testing scenario in which we assess the Long-Term Memory, Continual Learning, and Information Integration capabilities of the agents.
arXiv Detail & Related papers (2024-09-30T12:01:29Z) - Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Enabling Real-Time Conversations with Minimal Training Costs [61.80370154101649]
This paper presents a new duplex decoding approach that enhances large language models with duplex ability, requiring minimal training.
Experimental results indicate that our proposed method significantly enhances the naturalness and human-likeness of user-AI interactions with minimal training costs.
arXiv Detail & Related papers (2024-09-18T06:27:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.