Fugu-MT 論文翻訳(概要): MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

論文の概要: MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

arxiv url: http://arxiv.org/abs/2604.12928v2
Date: Fri, 17 Apr 2026 07:00:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-20 13:38:49.297616
Title: MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models
Title（参考訳）: モシラグ:全二重言語モデルのための非同期知識検索
Authors: Chung-Ming Chien, Manu Orsini, Eugene Kharitonov, Neil Zeghidour, Karen Livescu, Alexandre Défossez,
Abstract要約: 非同期のフル音声モデルは、AI停止のフルタイムの対話性と自然な性質によって区別される。本フレームワークは,外部情報における知識要求型対話クエリと接地応答の同定を可能にする。本設計では,再学習を伴わないプラグ・アンド・プレイ検索手法をサポートし,アウト・オブ・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー・ツー
参考スコア（独自算出の注目度）: 62.05118198431989
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Speech-to-speech language models have recently emerged to enhance the naturalness of conversational AI. In particular, full-duplex models are distinguished by their real-time interactivity, including handling of pauses, interruptions, and backchannels. However, improving their factuality remains an open challenge. While scaling the model size could address this gap, it would make real-time inference prohibitively expensive. In this work, we propose MoshiRAG, a modular approach that combines a compact full-duplex interface with selective retrieval to access more powerful knowledge sources. Our asynchronous framework enables the model to identify knowledge-demanding queries and ground its responses in external information. By leveraging the natural temporal gap between response onset and the delivery of core information, the retrieval process can be completed while maintaining a natural conversation flow. With this approach, MoshiRAG achieves factuality comparable to the best publicly released non-duplex speech language models while preserving the interactivity inherent to full-duplex systems. Moreover, our flexible design supports plug-and-play retrieval methods without retraining and demonstrates strong performance on out-of-domain mathematical reasoning tasks.
Abstract（参考訳）: 音声音声言語モデルは、会話AIの自然性を高めるために最近登場した。特に、フルダブルプレックスモデルは、停止、中断、バックチャネルの処理を含むリアルタイムの対話性によって区別される。しかし、事実性を改善することは依然としてオープンな課題である。モデルサイズをスケールすることでこのギャップに対処できるが、リアルタイム推論は極めて高価になる。本研究では、より強力な知識ソースにアクセスするために、コンパクトなフル二重インタフェースと選択的検索を組み合わせたモジュラーアプローチであるMoshiRAGを提案する。我々の非同期フレームワークは、モデルが知識要求クエリを識別し、その応答を外部情報に基盤付けることを可能にする。応答開始とコア情報の配信の自然な時間的ギャップを生かして、自然な会話の流れを維持しながら検索処理を完了させることができる。このアプローチにより、MoshiRAGは、全二重システムに固有の相互作用性を保ちながら、最も一般に公開されている非二重言語モデルに匹敵する事実性を達成する。さらに, このフレキシブルな設計は, 再学習を伴わないプラグアンドプレイ検索手法をサポートし, 領域外の数学的推論タスクにおいて高い性能を示す。

論文の概要: MoshiRAG: Asynchronous Knowledge Retrieval for Full-Duplex Speech Language Models

関連論文リスト