Fugu-MT 論文翻訳(概要): RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

論文の概要: RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

arxiv url: http://arxiv.org/abs/2606.07401v1
Date: Fri, 05 Jun 2026 15:41:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-08 14:33:29.832401
Title: RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents
Title（参考訳）: RealDocBench: 実世界のレギュレーションドキュメンテーションにおけるフィールドレベルQAとレイアウト理解のためのベンチマーク
Authors: Ameya Joshi, Joon Kim, Gus Eggert, Joseph Bajor, Cindy Hao, Jing Reyhan, Kushal Byatnal, Eli Badgio,
Abstract要約: 文書解析システムは、住宅ローンの引受、財務報告、サプライチェーンのロジスティクス、臨床記録などの規制された領域にますます多く展開されている。ほとんどの公開ベンチマークは、学術的なレイアウトや合成散文のアダプタを評価し、単一のOCRまたはマークダウンレベルの類似度スコアを報告している。実際に規制された文書から構築された2トラックのベンチマークであるRealDocBenchを紹介する。
参考スコア（独自算出の注目度）: 0.9003228139607131
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Document parsing systems are increasingly deployed in high-stakes, regulated workflows such as mortgage underwriting, financial reporting, supply-chain logistics, and clinical records. Yet most public benchmarks evaluate parsers on clean academic layouts or synthetic prose, and report a single OCR or markdown-level similarity score. Such documents and metrics correlate poorly with what downstream agents actually need: the correct value for a specific field on a messy real-world page. We introduce RealDocBench, a two-track benchmark built from real regulated documents. The QA track contains 1,356 field-level questions over 581 documents spanning four domains, where each question is paired with a typed gold_dict of key-to-value answers and parsers are scored on both per-field and strict per-question accuracy. The layout track contains 1,500 human-verified page images annotated with COCO-style bounding boxes under a nine-class public taxonomy, scored with a Hungarian matcher that includes adjacency-aware split/merge recovery. We evaluate eighteen systems, spanning commercial parsing APIs, general-purpose VLMs, and open-source OCR models, under a uniform extraction-and-scoring protocol, and report accuracy alongside per-page cost and cache-busted latency. RealDocBench exposes a wide performance spread that single-number benchmarks hide, a persistently hard medical sub-domain, and sharp cost/latency trade-offs across operating points. We release the datasets, parser adapters, and evaluation harness to support reproducible, field-level comparison of document parsing systems.
Abstract（参考訳）: 文書解析システムは、住宅ローンの引受、財務報告、サプライチェーンのロジスティクス、臨床記録などの規制されたワークフローに、ますます多く展開されている。しかし、ほとんどの公開ベンチマークは、クリーンな学術的レイアウトまたは合成散文のパーサーを評価し、単一のOCRまたはマークダウンレベルの類似度スコアを報告している。このようなドキュメントやメトリクスは、下流のエージェントが本当に必要とするもの、すなわち、散らかった現実世界のページ上の特定のフィールドの正しい値と相関する。実際に規制された文書から構築された2トラックのベンチマークであるRealDocBenchを紹介する。 QAトラックには、4つのドメインにまたがる581のドキュメントに1,356のフィールドレベルの質問が含まれている。レイアウトトラックには、CCOスタイルのバウンディングボックスで注釈付けされた1500枚の人体認証ページイメージが含まれており、9クラスの公共分類の下で、アジャクシー対応のスプリット/マージリカバリを含むハンガリーのマッカーでスコア付けされている。我々は,商用パーシングAPI,汎用VLM,オープンソースOCRモデルにまたがる18のシステムについて,一様抽出・スコアリングプロトコルを用いて評価し,ページ単位のコストとキャッシュバスト遅延を報告した。 RealDocBenchは、シングルナンバーのベンチマークが隠している広範なパフォーマンスのスプレッド、永続的にハードな医療サブドメイン、運用ポイント間のコスト/レイテンシのトレードオフを公開している。文書解析システムの再現可能なフィールドレベル比較をサポートするために,データセット,パーサアダプタ,評価ハーネスをリリースする。

論文の概要: RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

関連論文リスト