SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
- URL: http://arxiv.org/abs/2510.06917v2
- Date: Sat, 18 Oct 2025 07:57:12 GMT
- Title: SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
- Authors: Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang,
- Abstract summary: Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn.<n>This prevents the model from interacting during the user's turn and can lead to high response latency while it waits to think.<n>We propose SHANKS, a framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input.
- Score: 158.18422855768756
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user's turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally "think while listening." In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/
Related papers
- Do LLMs Benefit From Their Own Words? [56.73014497206615]
We find that removing prior assistant responses does not affect response quality on a large fraction of turns.<n>Omitting assistant-side context can reduce cumulative context lengths by up to 10x.<n>Our findings suggest that selectively omitting assistant history can improve response quality while reducing memory consumption.
arXiv Detail & Related papers (2026-02-27T18:58:26Z) - Plantain: Plan-Answer Interleaved Reasoning [38.046123106961176]
Reasoning models often spend a significant amount of time thinking before they generate a visible response.<n>We propose interleaved reasoning, in which the model alternates between thinking and surfacing intermediate responses.<n>Plantain is where the first intermediate response is an explicit, step-by-step plan for executing the task.
arXiv Detail & Related papers (2025-12-02T19:22:12Z) - Chronological Thinking in Full-Duplex Spoken Dialogue Language Models [66.84843878538207]
Chronological Thinking aims to improve response quality in full SDLMs.<n>No additional latency: once the user stops speaking, the agent halts thinking and begins speaking without further delay.<n>Results: Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations.
arXiv Detail & Related papers (2025-10-02T10:28:11Z) - STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models [131.90117151306993]
Spoken Language Models (SLMs) are designed to take speech inputs and produce spoken responses.<n>Current SLMs lack the ability to perform an internal, unspoken thinking process before responding.<n>We propose Stitch, a novel generation method that alternates between the generation of unspoken reasoning chunks and spoken response chunks.
arXiv Detail & Related papers (2025-07-21T08:30:03Z) - Interactive Reasoning: Visualizing and Controlling Chain-of-Thought Reasoning in Large Language Models [54.85405423240165]
We introduce Interactive Reasoning, an interaction design that visualizes chain-of-thought outputs as a hierarchy of topics.<n>We implement interactive reasoning in Hippo, a prototype for AI-assisted decision making in the face of uncertain trade-offs.
arXiv Detail & Related papers (2025-06-30T10:00:43Z) - Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech.
We show how Moshi Moshi can provide streaming speech recognition and text-to-speech.
Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Turn-Taking Prediction for Natural Conversational Speech [40.189938418201656]
A common conversational utterance often involves multiple queries with turn-taking.
Disfluencies include pausing to think, hesitations, word lengthening, filled pauses and repeated phrases.
We present a turntaking predictor built on top of the end-to-end (E2E) speech recognizer.
arXiv Detail & Related papers (2022-08-29T01:09:23Z) - Improved Goal Oriented Dialogue via Utterance Generation and Look Ahead [5.062869359266078]
intent prediction can be improved by training a deep text-to-text neural model to generate successive user utterances from unlabeled dialogue data.
We present a novel look-ahead approach that uses user utterance generation to improve intent prediction in time.
arXiv Detail & Related papers (2021-10-24T11:12:48Z) - "Wait, I'm Still Talking!" Predicting the Dialogue Interaction Behavior
Using Imagine-Then-Arbitrate Model [24.560203199376478]
In real human-human conversations, human often sequentially sends several short messages for readability instead of a long message in one turn.
We propose a novel Imagine-then-Arbitrate (ITA) neural dialogue model to help the agent decide whether to wait or to make a response directly.
arXiv Detail & Related papers (2020-02-22T04:05:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.