Fugu-MT 論文翻訳(概要): Probing Cross-modal Information Hubs in Audio-Visual LLMs

論文の概要: Probing Cross-modal Information Hubs in Audio-Visual LLMs

arxiv url: http://arxiv.org/abs/2605.10815v2
Date: Tue, 12 May 2026 02:51:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 18:21:07.139391
Title: Probing Cross-modal Information Hubs in Audio-Visual LLMs
Title（参考訳）: オーディオ・ビジュアルLLMにおけるクロスモーダル情報ハブの提案
Authors: Jihoo Jung, Chaeyoung Jung, Ji-Hoon Kim, Joon Son Chung,
Abstract要約: オーディオ視覚大言語モデル(AVLLM)は、音声、視覚、テキストのモダリティを共同で推論できる強力なアーキテクチャとして登場した。本稿では,一方のモダリティから派生した情報を他方のモダリティのトークン表現に符号化する方法について検討する。本研究では,クロスモーダルシンクトークン内の統合型クロスモーダル情報への依存を促すことにより,トレーニング不要な幻覚緩和手法を提案する。
参考スコア（独自算出の注目度）: 35.95951982211213
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Audio-visual large language models (AVLLMs) have recently emerged as a powerful architecture capable of jointly reasoning over audio, visual, and textual modalities. In AVLLMs, the bidirectional interaction between audio and video modalities introduces intricate processing dynamics, necessitating a deeper understanding of their internal mechanisms. However, unlike extensively studied text-only or large vision language models, the internal workings of AVLLMs remain largely unexplored. In this paper, we focus on cross-modal information flow between audio and visual modalities in AVLLMs, investigating where information derived from one modality is encoded within the token representations of the other modality. Through an analysis of multiple recent AVLLMs, we uncover two common findings. First, AVLLMs primarily encode integrated audio-visual information in sink tokens. Second, sink tokens do not uniformly hold cross-modal information. Instead, a distinct subset of sink tokens, which we term cross-modal sink tokens, specializes in storing such information. Based on these findings, we further propose a simple training-free hallucination mitigation method by encouraging reliance on integrated cross-modal information within cross-modal sink tokens. Our code is available at https://github.com/kaistmm/crossmodal-hub.
Abstract（参考訳）: オーディオ視覚大言語モデル(AVLLM)は、最近、音声、視覚、テキストのモダリティを共同で推論できる強力なアーキテクチャとして登場した。 AVLLMでは、オーディオとビデオの双方向の相互作用は複雑な処理力学を導入し、内部メカニズムをより深く理解する必要がある。しかし、広く研究されているテキストオンリーや大きな視覚言語モデルとは異なり、AVLLMの内部構造はほとんど解明されていない。本稿では,AVLLMにおける音声と視覚のモーダル間の相互モーダル情報の流れに着目し,他のモーダルのトークン表現において,あるモーダルから派生した情報がどこに符号化されているかを検討する。 AVLLMを複数回分析した結果,2つの共通点が判明した。まず、AVLLMはシンクトークンに統合されたオーディオ視覚情報をエンコードする。第二に、シンクトークンはクロスモーダル情報を均一に保持しない。代わりに、クロスモーダルシンクトークンと呼ばれるシンクトークンの別個のサブセットは、そのような情報を格納することに特化している。これらの知見に基づいて,クロスモーダルシンクトークンにおけるクロスモーダル情報への依存を促すことで,簡易なトレーニング自由幻覚緩和法を提案する。私たちのコードはhttps://github.com/kaistmm/crossmodal-hub.comで利用可能です。

論文の概要: Probing Cross-modal Information Hubs in Audio-Visual LLMs

関連論文リスト