Fugu-MT 論文翻訳(概要): What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?

論文の概要: What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?

arxiv url: http://arxiv.org/abs/2508.06530v1
Date: Sun, 03 Aug 2025 03:11:48 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-12 21:23:28.413486
Title: What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?
Title（参考訳）: 視覚・言語モデルにおける物体の幻覚評価のための「Good」ディクタとは何か?
Authors: Ming-Kun Xie, Jia-Hao Xiao, Gang Niu, Lei Feng, Zhiqiang Kou, Min-Ling Zhang, Masashi Sugiyama,
Abstract要約: 本稿では,Halucination検索を用いたObject Probing Evaluationベンチマークを紹介する。これは、大きな視覚ランゲージモデルで幻覚を誘発する最も誤解を招きやすいイントラクタを生成することを目的としている。実験結果から, HOPEの精度は少なくとも9%低下し, 最先端のLVLMでは最大23%低下した。
参考スコア（独自算出の注目度）: 95.46087552542998
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision-Language Models (LVLMs), empowered by the success of Large Language Models (LLMs), have achieved impressive performance across domains. Despite the great advances in LVLMs, they still suffer from the unavailable object hallucination issue, which tends to generate objects inconsistent with the image content. The most commonly used Polling-based Object Probing Evaluation (POPE) benchmark evaluates this issue by sampling negative categories according to category-level statistics, \textit{e.g.}, category frequencies and co-occurrence. However, with the continuous advancement of LVLMs, the POPE benchmark has shown diminishing effectiveness in assessing object hallucination, as it employs a simplistic sampling strategy that overlooks image-specific information and restricts distractors to negative object categories only. In this paper, we introduce the Hallucination searching-based Object Probing Evaluation (HOPE) benchmark, aiming to generate the most misleading distractors (\textit{i.e.}, non-existent objects or incorrect image descriptions) that can trigger hallucination in LVLMs, which serves as a means to more rigorously assess their immunity to hallucination. To explore the image-specific information, the content-aware hallucination searching leverages Contrastive Language-Image Pre-Training (CLIP) to approximate the predictive behavior of LVLMs by selecting negative objects with the highest predicted likelihood as distractors. To expand the scope of hallucination assessment, the description-based hallucination searching constructs highly misleading distractors by pairing true objects with false descriptions. Experimental results show that HOPE leads to a precision drop of at least 9\% and up to 23\% across various state-of-the-art LVLMs, significantly outperforming POPE in exposing hallucination vulnerabilities. The code is available at https://github.com/xiemk/HOPE.
Abstract（参考訳）: LVLM(Large Vision-Language Models)は、LLM(Large Language Models)の成功によって強化され、ドメイン間で印象的なパフォーマンスを実現している。 LVLMの大幅な進歩にもかかわらず、画像の内容と矛盾するオブジェクトを生成する傾向にある不利用可能なオブジェクト幻覚の問題に悩まされている。最もよく使われているPolling-based Object Probing Evaluation (POPE) ベンチマークは、カテゴリレベルの統計値、 \textit{e g }、カテゴリ頻度、共起値に基づいて負のカテゴリをサンプリングすることによってこの問題を評価する。しかし、LVLMの継続的な進歩により、POPEベンチマークは、画像特有の情報を見落とし、乱れを負の対象カテゴリのみに制限する単純なサンプリング戦略を採用するため、オブジェクト幻覚を評価する効果が低下している。本稿では,Halucination search-based Object Probing Evaluation (HOPE)ベンチマークを導入し,幻覚に対する免疫性をより厳格に評価する手段として,LVLMにおける幻覚を誘発する最も誤解を招く物体(非存在的物体や不正確な画像記述)を生成することを目的とする。画像固有情報を探索するために、コンテンツ認識幻覚検索は、Contrastive Language-Image Pre-Training (CLIP) を利用してLVLMの予測挙動を推定する。幻覚評価の範囲を拡大するために、記述に基づく幻覚探索は、真の物体と偽の記述とをペアにすることで、非常に誤解を招く気晴らしを発生させる。実験の結果,HOPEは様々なLVLMにおいて,少なくとも9\%,最大23\%の精度低下をもたらし,幻覚の脆弱性の暴露においてPOPEよりも有意に優れていた。コードはhttps://github.com/xiemk/HOPEで公開されている。

論文の概要: What Makes "Good" Distractors for Object Hallucination Evaluation in Large Vision-Language Models?

関連論文リスト