Fugu-MT 論文翻訳(概要): How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

論文の概要: How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

arxiv url: http://arxiv.org/abs/2605.10199v1
Date: Mon, 11 May 2026 08:46:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.664406
Title: How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
Title（参考訳）: 会話中のLLMの聴取方法 : 全二重音声対話におけるユーザストリームルーティングの検討
Authors: Hui Lu, Xueyuan Chen, Huimeng Wang, Shuhai Peng, Shiyin Kang, Xixin Wu, Zhiyong Wu,
Abstract要約: 音声対話システムは、ユーザ入力の到着生成をサポートする必要がある。チャネル融合はより強力な基底的意味を持ち、一貫してより良い質問性能をもたらす。クロスアテンションルーティングは質問応答では不十分だが、LLM生成コンテキストをよりよく保存する。
参考スコア（独自算出の注目度）: 36.88464167279495
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Full-duplex spoken dialogue requires a model to keep listening while generating its own spoken response. This is challenging for large language models (LLMs), which are designed to extend a single coherent sequence and do not naturally support user input arriving during generation. We argue that how the user stream is routed into the LLM is therefore a key architectural question for full-duplex modeling. To study this question, we extend a text-only LLM into a unified full-duplex spoken dialogue system and compare two routing strategies under a shared training pipeline: (i) channel fusion, which injects the user stream directly into the LLM input, and (ii) cross-attention routing, which keeps the user stream as external memory accessed through cross-attention adapters. Experiments on spoken question answering and full-duplex interaction benchmarks reveal a clear tradeoff. Channel fusion yields stronger semantic grounding and consistently better question-answering performance. However, under semantically overlapping conditions such as user interruptions, it is more vulnerable to context corruption: if the model fails to stop in time, the overlapping user stream can interfere with ongoing generation and lead to semantically incoherent continuations. Cross-attention routing underperforms on question answering, but better preserves the LLM generation context and is more robust to this failure mode. These results establish user-stream routing as a central design axis in full-duplex spoken dialogue and offer practical guidance on the tradeoff between semantic integration and context robustness. We provide a demo page for qualitative inspection.
Abstract（参考訳）: 全二重音声対話は、独自の音声応答を生成しながら聞き続けるモデルを必要とする。これは、単一のコヒーレントシーケンスを拡張するように設計され、生成時にユーザの入力を自然にサポートしない、大きな言語モデル(LLM)にとって難しい。したがって、ユーザストリームがLLMにどうルーティングされるかは、フルダブルプレックスモデリングにおいて重要なアーキテクチャ上の問題である、と我々は主張する。この問題を研究するために、テキストのみのLLMを統合された全二重音声対話システムに拡張し、2つのルーティング戦略を共有学習パイプラインで比較する。 i) LLM入力に直接ユーザストリームを注入するチャネル融合 (ii) クロスアテンションルーティング — クロスアテンションアダプタを通じて外部メモリへのアクセスとして、ユーザストリームを維持する。音声質問応答と全二重相互作用ベンチマークの実験は明確なトレードオフを示している。チャネル融合は、より強力なセマンティックグラウンド化と、一貫してより良い質問応答性能をもたらす。しかし、ユーザ中断のようなセマンティックに重複した条件下では、コンテキストの破損に対してより脆弱である。モデルが時間内に停止しなかった場合、オーバーラップしたユーザストリームは継続的な生成に干渉し、セマンティックに一貫性のない継続につながる。クロスアテンションルーティングは、質問応答では性能が劣るが、LLM生成コンテキストをよりよく保存し、この障害モードに対してより堅牢である。これらの結果から, ユーザ・ストリーム・ルーティングは, 全二重音声対話における中心的設計軸として確立され, セマンティック・インテグレーションとコンテキスト・ロバストネスのトレードオフに関する実践的なガイダンスが得られた。定性検査のためのデモページを提供する。

論文の概要: How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

関連論文リスト