Fugu-MT 論文翻訳(概要): Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models

論文の概要: Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models

arxiv url: http://arxiv.org/abs/2604.03556v1
Date: Sat, 04 Apr 2026 02:46:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:18.637929
Title: Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models
Title（参考訳）: 焦点:視覚・言語モデルにおける幻覚の位相認識抑制
Authors: Sohyeon Kim, Sang Yeon Yoon, Kyeongbo Kong,
Abstract要約: 大規模視覚言語モデル(LVLM)における視覚エンコーダの内部的注意ダイナミクスについて検討する。分析の結果,幻覚の挙動は集中期において注目度が低いトークンに特に敏感であることが判明した。本稿では、フォーカスフェーズにおいて、このようなトークンを選択的に抑制する軽量な推論時間介入を提案する。
参考スコア（独自算出の注目度）: 8.304027910542446
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Vision-Language Models (LVLMs) have achieved impressive progress in multimodal reasoning, yet they remain prone to object hallucinations, generating descriptions of objects that are not present in the input image. Recent approaches attempt to mitigate hallucinations by suppressing unreliable visual signals in the vision encoder, but many rely on iterative optimization for each input, resulting in substantial inference latency. In this work, we investigate the internal attention dynamics of vision encoders in LVLMs and identify a consistent three-phase structure of visual information processing: diffusion, focus, and rediffusion. Our analysis reveals that hallucination behavior is particularly sensitive to tokens receiving low attention during the focus phase. Motivated by this observation, we propose a lightweight inference-time intervention that selectively suppresses such tokens during the focus phase. The method operates in a training-free manner using statistics from a single forward pass and employs a Determinantal Point Process (DPP) to preserve diverse visual cues while filtering redundant tokens. Extensive experiments across multiple LVLM backbones and decoding strategies demonstrate that the proposed approach consistently reduces hallucination metrics while maintaining competitive caption quality. Moreover, compared to adversarial uncertainty estimation methods, our approach achieves comparable hallucination mitigation with negligible additional inference latency.
Abstract（参考訳）: LVLM(Large Vision-Language Models)は、マルチモーダル推論において顕著な進歩を遂げているが、オブジェクト幻覚の傾向が残り、入力画像に存在しないオブジェクトの記述を生成する。近年のアプローチでは、視覚エンコーダにおける信頼できない視覚信号を抑えることで幻覚を緩和しようとするが、多くの場合、各入力に対して反復的な最適化を頼りにしており、かなりの推論遅延をもたらす。本研究では,LVLMにおける視覚エンコーダの内部の注意動態を調査し,視覚情報処理における一貫した3相構造(拡散,焦点,再拡散)を同定する。分析の結果,幻覚の挙動は集中期において注目度が低いトークンに特に敏感であることが判明した。本研究の目的は,集中フェーズにおいて,このようなトークンを選択的に抑制する軽量な推論時間介入を提案することである。この方法は、1つのフォワードパスからの統計情報を用いてトレーニングフリーで動作し、冗長トークンをフィルタリングしながら様々な視覚的手がかりを保存するために決定点プロセス(DPP)を用いる。複数のLVLMバックボーンとデコード戦略の広範な実験により、提案手法は、競合キャプションの品質を維持しながら、常に幻覚の指標を減少させることを示した。さらに, 逆不確実性推定法と比較して, 提案手法は, 無視可能な追加推論遅延による幻覚軽減に匹敵する効果がある。

関連論文リスト

Hallucination Begins Where Saliency Drops [18.189047289404325]
幻覚は、前の出力トークンが次のトークンの予測に対して低い正当性を示すときにしばしば起こる。 LVLMs-Saliencyは,各出力トークンの視覚的グラウンドリング強度を定量化する,勾配認識型診断フレームワークである。本手法は, 流速とタスク性能を保ちながら幻覚率を大幅に低減し, 堅牢かつ解釈可能なソリューションを提供する。
論文参考訳（メタデータ） (2026-01-28T05:50:52Z)
Context-Aware Decoding for Faithful Vision-Language Generation [5.258492912374723]
視覚入力と矛盾する応答を生成する幻覚は、大きな視覚言語モデル(LVLM)の重要な限界である。本研究では,幻覚を駆動するレイヤワイズ生成ダイナミクスを探索し,学習自由化戦略を提案する。
論文参考訳（メタデータ） (2026-01-09T16:50:57Z)
PruneHal: Reducing Hallucinations in Multi-modal Large Language Models through Adaptive KV Cache Pruning [87.35309934860938]
大型言語モデル(MLLM)における幻覚は、視覚トークンに割り当てられた注意不足と強く関連している。我々は、適応的なKVキャッシュプルーニングを活用し、重要な視覚情報に焦点をあてるトレーニングフリーでシンプルで効果的な方法である textbfPruneHal を提案する。
論文参考訳（メタデータ） (2025-10-22T02:41:07Z)
IKOD: Mitigating Visual Attention Degradation in Large Vision-Language Models [20.036659182106806]
本稿では,LVLM(Large Vision-Language Models)が,シーケンス長の増大に伴って幻覚が増大する長期バイアスを示すことを示す。我々は、より画像中心のシーケンスを生成する協調デコーディング戦略である、イメージアテンション誘導キー値マージcOllaborative Decoding (IKOD)を提案する。
論文参考訳（メタデータ） (2025-08-05T14:05:15Z)
CAI: Caption-Sensitive Attention Intervention for Mitigating Object Hallucination in Large Vision-Language Models [60.0300765815417]
LVLM(Large Vision-Language Models)は、視覚情報から逸脱するコンテンツをしばしば生成し、物体の幻覚を引き起こす。本稿では,CAI (Caption-sensitive Attention Intervention) を提案する。
論文参考訳（メタデータ） (2025-06-30T07:52:36Z)
SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding [5.976839106353883]
SECOND: Selective and Contrastive Decodingは、視覚言語モデルがオブジェクト中心の方法でマルチスケールの視覚情報を活用できるようにする新しいアプローチです。 SECONDは知覚幻覚を著しく減らし、幅広いベンチマークを上回ります。
論文参考訳（メタデータ） (2025-06-10T02:55:38Z)
Grounding Language with Vision: A Conditional Mutual Information Calibrated Decoding Strategy for Reducing Hallucinations in LVLMs [51.93737995405164]
LVLM(Large Vision-Language Models)は幻覚の影響を受けやすいモデルである。本稿では,条件付きポイントワイド・ミューチュアル・インフォメーション(C-PMI)キャリブレーション・デコーディング・ストラテジーを導入する。提案手法は,復号効率を保ちながら,LVLMの幻覚を著しく低減することを示す。
論文参考訳（メタデータ） (2025-05-26T08:36:10Z)
MINT: Mitigating Hallucinations in Large Vision-Language Models via Token Reduction [6.416957959150438]
幻覚は、高い信頼性を必要とする領域におけるLVLM(Large Vision-Language Models)の適用を妨げる。 tokeN再帰による幻覚を緩和する訓練不要な復号法であるMINTを提案する。提案手法は,従来のモデルに比べて知覚障害による幻覚の緩和効果が4%向上する。
論文参考訳（メタデータ） (2025-02-02T08:34:57Z)
Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding [66.06337890279839]
大規模視覚言語モデル(LVLM)は、下流のマルチモーダルタスクに対する視覚言語理解において顕著な能力を示している。 LVLMは、複雑な生成タスクにおいて幻覚を生じさせ、視覚入力と生成されたコンテンツの間に矛盾が生じている。本研究では,LVLMにおける幻覚を無訓練で緩和するIMCCD法を提案する。
論文参考訳（メタデータ） (2025-01-03T17:56:28Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。