Fugu-MT 論文翻訳(概要): SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

論文の概要: SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

arxiv url: http://arxiv.org/abs/2510.06917v1
Date: Wed, 08 Oct 2025 11:48:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-09 16:41:20.468643
Title: SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models
Title（参考訳）: SHANKS: 音声言語モデルの同時聴取と思考
Authors: Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang,
Abstract要約: 現在の大規模言語モデル (LLM) と音声言語モデル (SLM) は、ユーザがターンを終えた後にのみ、思考と行動を取る。これにより、モデルがユーザのターン中に対話するのを防ぎ、考えるのを待つ間、レスポンスのレイテンシが高くなります。 SHANKSは,ユーザ入力を聴きながら,無意味な連鎖推論をSLMが生成できるフレームワークである。
参考スコア（独自算出の注目度）: 158.18422855768756
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user's turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally "think while listening." In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/
Abstract（参考訳）: 現在の大規模言語モデル (LLM) と音声言語モデル (SLM) は、ユーザがターンを終えた後にのみ、思考と行動を取る。これにより、モデルがユーザのターン中に対話するのを防ぎ、考えるのを待つ間、レスポンスのレイテンシが高くなります。その結果, リアルタイム低遅延交換が重要となる音声と音声の対話には, 完全な入力を受信した後の思考は適さないことがわかった。我々は、人間が自然に「耳を傾けながら考える」ことに注意して、この問題に対処する。本稿では、ユーザ入力を聴きながら、SLMが無意味な連鎖推論を生成できる一般的な推論フレームワークであるSHANKSを提案する。 SHANKSは、入力された音声を固定順のチャンクでストリームし、チャンクが受信されると、以前のすべての音声と推論に基づいて無意味な推論を生成し、ユーザは引き続き話す。 SHANKSはこの予期せぬ推論を使用して、ユーザを中断するか、タスクを完了させるためにツールコールを行うかを決定する。 SHANKSは,(1)ユーザが数学問題に対してステップバイステップのソリューションを提示している場合,SHANKSは,ユーザがミスを犯したときの聴取,理性,割り込みが可能であり,また,思考なしで割り込みを行うベースラインよりも37.1%高い割り込み精度を達成でき,(2)ツール拡張対話では,ユーザがターンを終了する前にツールコールの56.9%を完了させることができる。全体としてSHANKSは、ターンが終わった後だけでなく、会話を通して考え続けるモデルに向かっている。 Shanksのアニメーションイラストはhttps://d223302.github.io/SHANKS/にある。

論文の概要: SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

関連論文リスト