Fugu-MT 論文翻訳(概要): CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

論文の概要: CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

arxiv url: http://arxiv.org/abs/2605.12882v1
Date: Wed, 13 May 2026 01:54:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.755229
Title: CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
Title（参考訳）: CiteVQA: 信頼できるドキュメントインテリジェンスに対するエビデンス属性のベンチマーク
Authors: Dongsheng Ma, Jiayu Li, Zhengren Wang, Yijie Wang, Jiahao Kong, Weijun Zeng, Jutao Xiao, Jie Yang, Wentao Zhang, Bin Wang, Conghui He,
Abstract要約: CiteVQAは要素レベルのバウンディングボックスの引用を各回答と一緒に返さなければならないベンチマークである。 CiteVQAは7つのドメインと2つの言語にまたがる711のPDFに1,897の質問があり、1ドキュメントあたり平均40.6ページである。 Strict Attributed Accuracy (SAA) は、回答と引用された領域の両方が正しい場合にのみ、予測を信用する。
参考スコア（独自算出の注目度）: 35.06938031991452
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have significantly advanced document understanding, yet current Doc-VQA evaluations score only the final answer and leave the supporting evidence unchecked. This answer-only approach masks a critical failure mode: a model can land on the correct answer while grounding it in the wrong passage -- a critical risk in high-stakes domains like law, finance, and medicine, where every conclusion must be traceable to a specific source region. To address this, we introduce CiteVQA, a benchmark that requires models to return element-level bounding-box citations alongside each answer, evaluating both jointly. CiteVQA comprises 1,897 questions across 711 PDFs spanning seven domains and two languages, averaging 40.6 pages per document. To ensure fidelity and scalability, the ground-truth citations are generated by an automated pipeline-which identifies crucial evidence via masking ablation-and are subsequently validated through expert review. At the core of our evaluation is Strict Attributed Accuracy (SAA), which credits a prediction only when the answer and the cited region are both correct. Auditing 20 MLLMs reveals a pervasive Attribution Hallucination: models frequently produce the right answer while citing the wrong region. The strongest system (Gemini-3.1-Pro-Preview) achieves an SAA of only 76.0, and the strongest open-source MLLM reaches just 22.5. Ultimately, towards trustworthy document intelligence, CiteVQA exposes a reliability gap that answer-only evaluations overlook, providing the instrumentation needed to close it. Our repository is available at https://github.com/opendatalab/CiteVQA.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は文書の理解が著しく進んでいるが、現在のDoc-VQA評価は最終回答のみをスコアし、支持する証拠を未確認のまま残している。この答えのみのアプローチは、致命的な障害モードを隠蔽している: モデルは正しい答えに着地し、間違ったパスに根拠を付ける -- 法律、金融、医療のような高リスクな領域において、すべての結論が特定のソース領域にトレースされなければならない重要なリスクである。そこで我々はCiteVQAを提案する。CiteVQAはモデルが各回答とともに要素レベルのバウンディングボックスの引用を返すことを要求するベンチマークで、両者を共同で評価する。 CiteVQAは7つのドメインと2つの言語にまたがる711のPDFに1,897の質問があり、1ドキュメントあたり平均40.6ページである。忠実さとスケーラビリティを確保するため、地下の真実の引用は自動パイプラインによって生成され、マスキングアブレーションを通じて重要な証拠を識別し、その後、専門家のレビューによって検証される。 Strict Attributed Accuracy (SAA) は、回答と引用された領域の両方が正しい場合にのみ、予測を信用する。 20個のMLLMを監査すると、広範に分布する帰納的幻覚が明らかになる: モデルは、間違った領域を引用しながら、正しい答えをしばしば生成する。最強のシステム(Gemini-3.1-Pro-Preview)は76.0のSAAを達成し、最強のオープンソースMLLMは22.5に達した。最終的に、信頼できるドキュメントインテリジェンスに向けて、CiteVQAは、回答のみの評価が見落としている信頼性のギャップを露呈し、それを閉じるために必要な機器を提供する。私たちのリポジトリはhttps://github.com/opendatalab/CiteVQA.comで公開されています。

論文の概要: CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

関連論文リスト