Fugu-MT 論文翻訳(概要): Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents

論文の概要: Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents

arxiv url: http://arxiv.org/abs/2511.11468v1
Date: Fri, 14 Nov 2025 16:41:10 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-17 22:42:18.72596
Title: Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents
Title（参考訳）: Visual LLMs Resilienceのベンチマーク - Visually Rich Documentsに関する疑問に答える
Authors: Davide Napolitano, Luca Cagliero, Fabrizio Battiloro,
Abstract要約: 本稿では,VRD-UQA (viSUALLY RICH DOCUMENT UNANSWERABLE QUESTION ANSWERING) について述べる。我々の研究は、VLLMsの頑健さを、証明不可能な疑問に当てはめている。 VLLMsの限界を明らかにし,VRD-UQAがレジリエントな文書VQAシステムの開発のための評価フレームワークとして機能することを示す。
参考スコア（独自算出の注目度）: 7.765294037858163
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The evolution of Visual Large Language Models (VLLMs) has revolutionized the automatic understanding of Visually Rich Documents (VRDs), which contain both textual and visual elements. Although VLLMs excel in Visual Question Answering (VQA) on multi-page VRDs, their ability to detect unanswerable questions is still an open research question. Our research delves into the robustness of the VLLMs to plausible yet unanswerable questions, i.e., questions that appear valid but cannot be answered due to subtle corruptions caused by swaps between related concepts or plausible question formulations. Corruptions are generated by replacing the original natural language entities with other ones of the same type, belonging to different document elements, and in different layout positions or pages of the related document. To this end, we present VRD-UQA (VISUALLY RICH DOCUMENT UNANSWERABLE QUESTION ANSWERING), a benchmark for evaluating VLLMs' resilience to plausible yet unanswerable questions across multiple dimensions. It automatically alters the questions of existing VQA datasets consisting of multi-page VRDs, verifies their unanswerability using a VLLM-as-a-judge approach, and then thoroughly evaluates VLLMs' performance. Experiments, run on 12 models, analyze: (1) The VLLMs' accuracy in detecting unanswerable questions at both page and document levels; (2) The effect of different types of corruption (NLP entity, document element, layout); (3) The effectiveness of different knowledge injection strategies based on in-context learning (OCR, multi-page selection, or the possibility of unanswerability). Our findings reveal VLLMs' limitations and demonstrate that VRD-UQA can serve as an evaluation framework for developing resilient document VQA systems.
Abstract（参考訳）: Visual Large Language Models (VLLMs)の進化は、テキスト要素とビジュアル要素の両方を含むVisually Rich Documents (VRDs)の自動理解に革命をもたらした。 VLLMは多ページVRDのビジュアル質問回答(VQA)に優れていますが、未解決の質問を検出する能力は依然としてオープンな研究課題です。我々の研究は、VLLMsの頑健さを実証不可能な質問、すなわち、有効なように見えるが、関連する概念間のスワップや妥当な質問の定式化によって引き起こされる微妙な汚職によって答えられない質問に掘り下げている。破損は、元の自然言語エンティティを同じタイプの他のエンティティに置き換え、異なるドキュメント要素に属し、関連するドキュメントの異なるレイアウト位置またはページで発生する。そこで本研究では,VRD-UQA (viSUALLY RICH DOCUMENT UNANSWERABLE QUESTION ANSWERING) を,VLLMのレジリエンスを,複数の次元にまたがる検証不可能な質問に対して評価するためのベンチマークとして提示する。マルチページVRDで構成された既存のVQAデータセットの質問を自動的に修正し、VLLM-as-a-judgeアプローチを使用してその未解決性を検証し、VLLMのパフォーマンスを徹底的に評価する。実験は,(1) ページレベルと文書レベルの両方で解答不能な質問を検出するVLLMsの精度,(2) 異なるタイプの汚職(NLPエンティティ,文書要素,レイアウト)の影響,(3) コンテキスト内学習(OCR,複数ページ選択,あるいは解答不能の可能性)に基づく異なる知識注入戦略の有効性について分析した。 VLLMsの限界を明らかにし,VRD-UQAがレジリエントな文書VQAシステムの開発のための評価フレームワークとして機能することを示す。

論文の概要: Benchmarking Visual LLMs Resilience to Unanswerable Questions on Visually Rich Documents

関連論文リスト