Fugu-MT 論文翻訳(概要): Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation

論文の概要: Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation

arxiv url: http://arxiv.org/abs/2511.09984v1
Date: Fri, 14 Nov 2025 01:23:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-14 22:53:22.619643
Title: Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation
Title（参考訳）: 多言語検索音声生成における言語ドリフト:特徴付けと復号化
Authors: Bo Li, Zhenghua Xu, Rui Xie,
Abstract要約: 複数のデータセット,言語,LLMのバックボーンにまたがる多言語RAGにおける出力言語ドリフトについて検討した。実験の結果,デコーダレベルの崩壊によるドリフトの結果が明らかとなり,そこではトークン分布が支配的であり,高頻度の英文パターンが意図された生成言語を支配下に置くことがわかった。そこで本研究では,対象言語を優雅に操る軽量でトレーニング不要なデコーディング戦略であるSoft Constrained Decoding (SCD)を提案する。
参考スコア（独自算出の注目度）: 11.110312833458421
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multilingual Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to perform knowledge-intensive tasks in multilingual settings by leveraging retrieved documents as external evidence. However, when the retrieved evidence differs in language from the user query and in-context exemplars, the model often exhibits language drift by generating responses in an unintended language. This phenomenon is especially pronounced during reasoning-intensive decoding, such as Chain-of-Thought (CoT) generation, where intermediate steps introduce further language instability. In this paper, we systematically study output language drift in multilingual RAG across multiple datasets, languages, and LLM backbones. Our controlled experiments reveal that the drift results not from comprehension failure but from decoder-level collapse, where dominant token distributions and high-frequency English patterns dominate the intended generation language. We further observe that English serves as a semantic attractor under cross-lingual conditions, emerging as both the strongest interference source and the most frequent fallback language. To mitigate this, we propose Soft Constrained Decoding (SCD), a lightweight, training-free decoding strategy that gently steers generation toward the target language by penalizing non-target-language tokens. SCD is model-agnostic and can be applied to any generation algorithm without modifying the architecture or requiring additional data. Experiments across three multilingual datasets and multiple typologically diverse languages show that SCD consistently improves language alignment and task performance, providing an effective and generalizable solution in multilingual RAG.
Abstract（参考訳）: 多言語検索・拡張生成(RAG)により、検索した文書を外部証拠として活用することにより、多言語設定における知識集約的なタスクを大規模言語モデル(LLM)で実行することができる。しかし、検索されたエビデンスがユーザクエリとコンテキスト内例と異なる場合、意図しない言語で応答を生成することで、しばしば言語ドリフトを示す。この現象は、特にChain-of-Thought (CoT) 生成のような推論集約的な復号の際に顕著であり、中間段階では言語不安定が生じる。本稿では,複数のデータセット,言語,LLMバックボーンにまたがる多言語RAGにおける出力言語ドリフトを系統的に研究する。制御実験の結果,ドリフトの結果は理解不能ではなくデコーダレベルの崩壊によるものであることが判明した。さらに、英語が言語間条件下での意味的魅力として機能し、最も強い干渉源と最も頻繁なフォールバック言語の両方として現れることを観察する。そこで本研究では,ソフト制約デコーディング(SCD)を提案する。これは軽量でトレーニング不要なデコーディング戦略であり,非ターゲット言語トークンをペナルタイズすることで,ターゲット言語に対して優しく生成を行う。 SCDはモデルに依存しないため、アーキテクチャを変更したり追加データを必要とすることなく、任意の世代アルゴリズムに適用することができる。 3つの多言語データセットと多言語多言語間の実験により、SCDは言語アライメントとタスクパフォーマンスを一貫して改善し、多言語RAGにおいて効果的で一般化可能なソリューションを提供することが示された。

論文の概要: Language Drift in Multilingual Retrieval-Augmented Generation: Characterization and Decoding-Time Mitigation

関連論文リスト