Fugu-MT 論文翻訳(概要): When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

論文の概要: When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

arxiv url: http://arxiv.org/abs/2604.17375v1
Date: Sun, 19 Apr 2026 10:58:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.491535
Title: When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models
Title（参考訳）: テキストハイジャックのビジョン:視覚言語モデルにおけるテキストオーバーレイ誘発幻覚のベンチマークと緩和
Authors: Cui Yakun, Xingqun Qi, TianTian Geng, Yuyao Zhang, Sirui Han, Yike Guo,
Abstract要約: 画面上に埋め込まれたテキストが視覚シーンと矛盾する場合、既存の視覚言語モデル(VLM)は体系的に幻覚を与える。大規模な人為的なサンプルを含む,最初の包括的なベンチマークである VisualTextTrap を提案する。また,視覚テキスト変換フレームワークであるVTHM-MoE(Visual Text Hallucination Mitigation Mixture-of-Experts)を提案する。
参考スコア（独自算出の注目度）: 38.20863524973651
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in Vision-Language Models (VLMs) have substantially enhanced their ability across multimodal video understanding benchmarks spanning temporal, action, object, and spatial understanding. However, we identify a critical yet overlooked issue: when embedded on-screen text contradicts the visual scene, existing VLMs systematically hallucinate, prioritizing overlay textual semantics over the actual visual content. We define this phenomenon as Text Overlay-Induced Hallucination (TOIH). In this work, we propose VisualTextTrap, the first comprehensive benchmark, including large-scale human-validated samples with specifically designed evaluation metrics. In particular, we construct VisualTextTrap from widely-used public datasets using a scalable hybrid pipeline of VLMs assisted text generation and rigorous manual verification. The benchmark features 6,057 samples annotated across 88 fine-grained attributes within four dimensions, with hallucination intensity quantified on a five-level scale (L1--L5) that reflects the semantic contradiction between overlay text and visual reality. Moreover, we propose Visual Text Hallucination Mitigation Mixture-of-Experts (VTHM-MoE), a novel Vision-Text Disentanglement framework that employs a dual-encoder architecture. Concretely, four dimension-specialized expert modules spanning Temporal, Action, Object, and Spatial reasoning are first pre-trained to identify and leverage cross-modal discrepancies between textual semantics and actual video content. We develop an Adaptive Token Routing Strategy to enable dynamic expert allocation, conferring robust resistance to TOIH while preserving performance on uncontaminated videos. Extensive experiments conducted on our VisualTextTrap benchmark verify the effectiveness of VTHM-MoE, outperforming state-of-the-art counterparts with diverse video question answering tasks.
Abstract（参考訳）: VLM(Vision-Language Models)の最近の進歩は、時間的、行動的、対象的、空間的理解にまたがるマルチモーダルビデオ理解ベンチマークにおいて、その能力を大幅に向上させてきた。しかし、画面上に埋め込まれたテキストが視覚的シーンと矛盾する場合、既存のVLMは体系的に幻覚を与え、実際の視覚コンテンツに対してオーバーレイテキストセマンティクスを優先する。この現象をテキストオーバーレイ誘導幻覚(TOIH)と定義する。本研究では,特に評価指標を設計した大規模人間検証サンプルを含む,最初の総合的なベンチマークであるVisualTextTrapを提案する。特に、VLMのスケーラブルなハイブリッドパイプラインを用いて、広く使われている公開データセットからVisualTextTrapを構築し、テキスト生成と厳密な手動検証を行う。ベンチマークでは、88の微粒な属性を4次元で注釈付けした6,057のサンプルと、オーバーレイテキストとビジュアルリアリティのセマンティックな矛盾を反映した5レベルスケール(L1-L5)の幻覚強度が特徴である。さらに,デュアルエンコーダアーキテクチャを用いた視覚テキスト変換フレームワークであるVisual Text Hallucination Mitigation Mixture-of-Experts (VTHM-MoE)を提案する。具体的には、テンポラル、アクション、オブジェクト、空間的推論にまたがる4次元の専門家モジュールを事前に訓練し、テキスト意味論と実際のビデオコンテンツ間の相互の相違を識別し、活用する。我々は,非汚染ビデオのパフォーマンスを維持しながら,TOIHに対する頑健な耐性を付与し,動的専門家アロケーションを可能にする適応トークンルーティング戦略を開発した。 VisualTextTrapベンチマークで行った大規模な実験により、VTHM-MoEの有効性が検証された。

論文の概要: When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

関連論文リスト