Fugu-MT 論文翻訳(概要): Context-Aware Decoding for Faithful Vision-Language Generation

論文の概要: Context-Aware Decoding for Faithful Vision-Language Generation

arxiv url: http://arxiv.org/abs/2601.05939v1
Date: Fri, 09 Jan 2026 16:50:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-12 17:41:50.047863
Title: Context-Aware Decoding for Faithful Vision-Language Generation
Title（参考訳）: 忠実な視覚言語生成のための文脈認識デコーディング
Authors: Mehrdad Fazli, Bowen Wei, Ziwei Zhu,
Abstract要約: 視覚入力と矛盾する応答を生成する幻覚は、大きな視覚言語モデル(LVLM)の重要な限界である。本研究では,幻覚を駆動するレイヤワイズ生成ダイナミクスを探索し,学習自由化戦略を提案する。
参考スコア（独自算出の注目度）: 5.258492912374723
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.
Abstract（参考訳）: 視覚入力と矛盾する応答を生成する幻覚は、特に画像キャプションや視覚的推論のようなオープンなタスクにおいて、大きな視覚言語モデル(LVLM)の限界である。本研究では,幻覚を駆動するレイヤワイズ生成ダイナミクスを探索し,学習自由化戦略を提案する。 Logit Lensを用いて、LVLMがデコーダ層にまたがる次のトーケン分布をどのように構築するかを検証し、はっきりしたコミットメントと深さのギャップを明らかにする:真正なトークンは幻覚よりも早く最終候補に確率質量を蓄積する。この発見に基づいて,最後の入力トークンの隠蔽状態であるコンテキスト埋め込み注入(CEI)を接地信号として利用し,幻覚の復号化と抑制を行う軽量な手法であるコンテキスト埋め込み注入(CEI)を導入する。 CHAIR、AMBER、MMHal-Benchベンチマーク(最大トークン長512)で評価され、CEIは3つのLVLMで最先端のベースラインを上回り、その動的変種は全体の幻覚率を最も低くする。スケーラブルな介入と新しい機械的洞察を統合することで、この研究はLVLMにおける幻覚の緩和を促進する。

論文の概要: Context-Aware Decoding for Faithful Vision-Language Generation

関連論文リスト