ChipChat: Low-Latency Cascaded Conversational Agent in MLX
- URL: http://arxiv.org/abs/2509.00078v1
- Date: Tue, 26 Aug 2025 20:40:24 GMT
- Title: ChipChat: Low-Latency Cascaded Conversational Agent in MLX
- Authors: Tatiana Likhomanenko, Luke Carlson, Richard He Bai, Zijin Gu, Han Tran, Zakaria Aldeneh, Yizhe Zhang, Ruixiang Zhang, Huangjie Zheng, Navdeep Jaitly,
- Abstract summary: ChipChat is a novel low-latency CS that overcomes traditional bottlenecks through architectural innovations and streaming optimizations.<n>Our work shows that strategically redesigned CSs can overcome their historical latency limitations, offering a promising path forward for practical voice-based AI agents.
- Score: 34.30974874671028
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The emergence of large language models (LLMs) has transformed spoken dialog systems, yet the optimal architecture for real-time on-device voice agents remains an open question. While end-to-end approaches promise theoretical advantages, cascaded systems (CSs) continue to outperform them in language understanding tasks, despite being constrained by sequential processing latency. In this work, we introduce ChipChat, a novel low-latency CS that overcomes traditional bottlenecks through architectural innovations and streaming optimizations. Our system integrates streaming (a) conversational speech recognition with mixture-of-experts, (b) state-action augmented LLM, (c) text-to-speech synthesis, (d) neural vocoder, and (e) speaker modeling. Implemented using MLX, ChipChat achieves sub-second response latency on a Mac Studio without dedicated GPUs, while preserving user privacy through complete on-device processing. Our work shows that strategically redesigned CSs can overcome their historical latency limitations, offering a promising path forward for practical voice-based AI agents.
Related papers
- LTS-VoiceAgent: A Listen-Think-Speak Framework for Efficient Streaming Voice Interaction via Semantic Triggering and Incremental Reasoning [27.13598270494417]
LTS-VoiceAgent is a Listen-Think-Speak framework that separates when to think from how to reason incrementally.<n>It features a Dynamic Semantic Trigger to detect meaningful prefixes, and a Dual-Role Stream Orchestrator that coordinates a background Thinker and a foreground Speaker.
arXiv Detail & Related papers (2026-01-26T15:42:35Z) - SLAM-LLM: A Modular, Open-Source Multimodal Large Language Model Framework and Best Practice for Speech, Language, Audio and Music Processing [77.87631792556942]
SLAM-LLM is an open-source framework designed to train customized Multimodal Large Language Models (MLLMs)<n>It provides a modular configuration of different encoders, projectors, LLMs, and parameter-efficient fine-tuning plugins.<n>It includes high-performance checkpoints like Automatic Speech Recognition (ASR), Automated Audio Captioning (AAC), and Music Captioning (MC)
arXiv Detail & Related papers (2026-01-14T11:25:36Z) - KAME: Tandem Architecture for Enhancing Knowledge in Real-Time Speech-to-Speech Conversational AI [14.667102744113295]
Real-time speech-to-speech (S2S) models excel at generating low-latency conversational responses but often lack deep knowledge and semantic understanding.<n>C cascaded systems combining automatic speech recognition, a text-based Large Language Model (LLM), and text-to-speech synthesis offer superior knowledge representation at the cost of high latency.<n>This paper introduces a novel hybrid architecture that bridges the gap between these two paradigms.
arXiv Detail & Related papers (2025-09-26T00:46:34Z) - PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction [29.64357898080842]
Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses.<n>Their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences.<n>We propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time.
arXiv Detail & Related papers (2025-06-18T15:29:02Z) - StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling [50.537794606598254]
StreamMel is a pioneering single-stage streaming TTS framework that models continuous mel-spectrograms.<n>It enables low-latency, autoregressive synthesis while preserving high speaker similarity and naturalness.<n>It even achieves performance comparable to offline systems while supporting efficient real-time generation.
arXiv Detail & Related papers (2025-06-14T16:53:39Z) - SpeakStream: Streaming Text-to-Speech with Interleaved Data [11.131427505801062]
We present SpeakStream, a streaming TTS system that generates audio incrementally from streaming text using a decoder-only architecture.<n>During inference, SpeakStream generates speech incrementally while absorbing streaming input text.<n>Our experiments demonstrate that SpeakStream achieves state-of-the-art latency while maintaining the quality of non-streaming TTS systems.
arXiv Detail & Related papers (2025-05-25T16:11:10Z) - Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge [57.01131456894516]
Current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios.<n>We propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction.<n>Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications.
arXiv Detail & Related papers (2025-01-23T08:33:10Z) - Large Generative Model-assisted Talking-face Semantic Communication System [55.42631520122753]
This study introduces a Large Generative Model-assisted Talking-face Semantic Communication (LGM-TSC) system.
Generative Semantic Extractor (GSE) at the transmitter converts semantically sparse talking-face videos into texts with high information density.
Private Knowledge Base (KB) based on the Large Language Model (LLM) for semantic disambiguation and correction.
Generative Semantic Reconstructor (GSR) that utilizes BERT-VITS2 and SadTalker models to transform text back into a high-QoE talking-face video.
arXiv Detail & Related papers (2024-11-06T12:45:46Z) - Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models [0.9285295512807729]
The Audio Question Answering (AQA) task includes audio event classification, audio captioning, and open-ended reasoning.<n>LALMs excel in general audio understanding, but are limited in temporal reasoning.<n>This paper addresses these challenges and limitations in audio temporal reasoning.
arXiv Detail & Related papers (2024-09-10T05:26:53Z) - Language Model Can Listen While Speaking [17.584201137311286]
Listen-while-speaking language model (LSLM) is an end-to-end system equipped with both listening and speaking channels.
Our results highlight LSLM's capability to achieve duplex communication with minimal impact on existing systems.
arXiv Detail & Related papers (2024-08-05T16:47:22Z) - SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen.
The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.