Fugu-MT 論文翻訳(概要): Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

論文の概要: Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

arxiv url: http://arxiv.org/abs/2505.16652v1
Date: Thu, 22 May 2025 13:19:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-23 17:12:48.31698
Title: Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding
Title（参考訳）: 遠視と明視:意図的因果復号によるMLLMの幻覚の軽減
Authors: Feilong Tang, Chengzhi Liu, Zhongxing Xu, Ming Hu, Zelin Peng, Zhiwei Yang, Jionglong Su, Minquan Lin, Yifan Peng, Xuelian Cheng, Imran Razzak, Zongyuan Ge,
Abstract要約: 我々は,トークンインタラクションプロセスから直接適切なコンテキスト情報を抽出できると主張している。復号化戦略における因果推論に着想を得て、因果マスクを活用してマルチモーダルトークン間の情報伝達を確立することを提案する。 FarSightは汎用的なプラグ・アンド・プレイ・デコード方式で,外部トークンからの注意干渉を低減する。
参考スコア（独自算出の注目度）: 33.33247964758369
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in multimodal large language models (MLLMs) have significantly improved performance in visual question answering. However, they often suffer from hallucinations. In this work, hallucinations are categorized into two main types: initial hallucinations and snowball hallucinations. We argue that adequate contextual information can be extracted directly from the token interaction process. Inspired by causal inference in the decoding strategy, we propose to leverage causal masks to establish information propagation between multimodal tokens. The hypothesis is that insufficient interaction between those tokens may lead the model to rely on outlier tokens, overlooking dense and rich contextual cues. Therefore, we propose to intervene in the propagation process by tackling outlier tokens to enhance in-context inference. With this goal, we present FarSight, a versatile plug-and-play decoding strategy to reduce attention interference from outlier tokens merely by optimizing the causal mask. The heart of our method is effective token propagation. We design an attention register structure within the upper triangular matrix of the causal mask, dynamically allocating attention to capture attention diverted to outlier tokens. Moreover, a positional awareness encoding method with a diminishing masking rate is proposed, allowing the model to attend to further preceding tokens, especially for video sequence tasks. With extensive experiments, FarSight demonstrates significant hallucination-mitigating performance across different MLLMs on both image and video benchmarks, proving its effectiveness.
Abstract（参考訳）: マルチモーダル大規模言語モデル(MLLM)の最近の進歩は,視覚的質問応答の性能を著しく向上させた。しかし、幻覚に悩まされることが多い。本研究では,幻覚は初期幻覚と雪玉幻覚の2種類に分類される。我々は,トークンインタラクションプロセスから直接適切なコンテキスト情報を抽出できると主張している。復号化戦略における因果推論にインスパイアされ,マルチモーダルトークン間の情報伝達を確立するために因果マスクを活用することを提案する。この仮説は、これらのトークン間の不十分な相互作用は、密集したリッチな文脈的手がかりを見渡すことによって、モデルが外れ値トークンに依存することにつながるかもしれないというものである。そこで本研究では,外部トークンに対処し,文脈内推論を強化することによって,伝播過程に介入することを提案する。この目的によりFarSightは,因果マスクを最適化することで,外部トークンからの注意干渉を軽減する汎用的なプラグアンドプレイデコーディング戦略である。我々の手法の核心は効果的なトークンの伝播である。因果マスクの上三角行列内にアテンションレジスタ構造を設計し,アテンションを動的にアロケートし,アテンションをオフリートークンに分散させる。さらに、マスキング率を低下させる位置認識符号化法を提案し、特にビデオシーケンスタスクにおいて、さらに先行するトークンにモデルが参加できるようにする。大規模な実験により、FarSightは画像とビデオのベンチマークで異なるMLLM間で幻覚軽減性能を示し、その効果を実証した。

論文の概要: Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding

関連論文リスト