Fugu-MT 論文翻訳(概要): Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

論文の概要: Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

arxiv url: http://arxiv.org/abs/2605.09906v1
Date: Mon, 11 May 2026 02:50:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.482077
Title: Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought
Title（参考訳）: 第1に, 第2に: モーダル・オブ・サートによるオーディオ・ビジュアル LLM の相互干渉の緩和
Authors: Xuanchen Li, Yuheng Lu, Chenrui Cui, Tianrui Wang, Zikang Huang, Yu Jiang, Long Zhou, Longbiao Wang, Jianwu Dang,
Abstract要約: モーダル間干渉を低減するための音声・視覚的推論フレームワークとして, 分離ファースト, ファウズ・レイト (SFFL) を提案する。 SFFLは、モーダリティ固有の連鎖推論を強制し、別々の音声および視覚的推論トレースを生成し、答えのエビデンスを統合する。実験では精度と頑健さの両面で一貫した改善が示され、一般的なAVQAベンチマークでは5.16%、クロスモーダル幻覚ベンチマークでは11.17%の平均的な相対的な増加が得られた。
参考スコア（独自算出の注目度）: 49.53567098922619
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another, thereby inducing hallucinations. We attribute this issue to uncontrolled cross-modal interactions during intermediate reasoning. To mitigate this, we propose Separate First, Fuse Later (SFFL), an audio-visual reasoning framework designed to reduce cross-modal interference. SFFL enforces modality-specific chain-of-thought reasoning, producing separate audio and visual reasoning traces and integrating evidence for answering. We construct modality-preference labels via a data pipeline under different modality input settings. We use these labels as an auxiliary reward in reinforcement learning to encourage a instance-dependent preference for modality cues when answering. We further introduce a modality-specific reasoning mechanism that preserves modality isolation during the separated reasoning stage while enabling full access to cross-modal information at the evidence fusion stage. Experiments demonstrate consistent improvements in both accuracy and robustness, yielding an average relative gain of 5.16\% on general AVQA benchmarks and 11.17\% on a cross-modal hallucination benchmark.
Abstract（参考訳）: 音声と視覚は、音声と視覚の質問応答を補完する証拠を提供するが、現在の音声と視覚の大きい言語モデルは、相互の干渉に悩まされる可能性がある。中間的推論において、この問題は制御不能な相互モーダル相互作用に起因している。これを軽減するために,モーダル間干渉を低減するための音声視覚推論フレームワークであるSeparate First, Fuse Later (SFFL)を提案する。 SFFLは、モーダリティ固有の連鎖推論を強制し、別々の音声および視覚的推論トレースを生成し、答えのエビデンスを統合する。モーダリティの入力設定が異なるデータパイプラインを用いてモーダリティ参照ラベルを構築する。我々はこれらのラベルを強化学習における補助的な報酬として使用し、応答時のモダリティの選好をインスタンス依存で促進する。さらに、分離された推論段階におけるモダリティ分離を保ちつつ、エビデンス融合段階におけるクロスモーダル情報への完全なアクセスを可能にするモダリティ特異的推論機構を導入する。実験は精度とロバスト性の両方で一貫した改善を示し、一般的なAVQAベンチマークでは5.16\%、クロスモーダル幻覚ベンチマークでは11.17\%となる。

論文の概要: Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

関連論文リスト