Fugu-MT 論文翻訳(概要): VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

論文の概要: VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

arxiv url: http://arxiv.org/abs/2603.15118v1
Date: Mon, 16 Mar 2026 11:15:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:58.089222
Title: VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents
Title（参考訳）: VAREX: 文書からのマルチモーダル構造化抽出のためのベンチマーク
Authors: Udi Barzelay, Ophir Azulai, Inbar Shapira, Idan Friedman, Foad Abo Dahood, Madison Lee, Abraham Daniels,
Abstract要約: VAREXは政府形態からの構造化データ抽出を評価するためのベンチマークである。ベンチマークは、1,777の文書と1,771のユニークな文書から成っており、3相品質保証を通じて真理を検証している。結果は、4Bパラメータ以下では、コンプライアンス出力 -- 抽出能力ではなく -- が主要なボトルネックであることを示している。
参考スコア（独自算出の注目度）: 1.06378109904813
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain text, layout-preserving text (whitespace-aligned to approximate column positions), document image, or both text and image combined. Unlike existing benchmarks that evaluate from a single input representation, VAREX provides four controlled modalities per document, enabling systematic ablation of how input format affects extraction accuracy -- a capability absent from prior benchmarks. We evaluate 20 models from frontier proprietary models to small open models, with particular attention to models <=4B parameters suitable for cost-sensitive and latency-constrained deployment. Results reveal that (1) below 4B parameters, structured output compliance -- not extraction capability -- is a dominant bottleneck; in particular, schema echo (models producing schema-conforming structure instead of extracted values) depresses scores by 45-65 pp (percentage points) in affected models; (2) extraction-specific fine-tuning at 2B yields +81 pp gains, demonstrating that the instruction-following deficit is addressable without scale; (3) layout-preserving text provides the largest accuracy gain (+3-18 pp), exceeding pixel-level visual cues; and (4) the benchmark most effectively discriminates models in the 60-95% accuracy band. Dataset and evaluation code are publicly available.
Abstract（参考訳）: VAREX(VARied-schema Extraction)は、政府形態からの構造化データ抽出に関するマルチモーダル基礎モデルを評価するためのベンチマークである。 VAREXは、PDFテンプレートを合成値でプログラム的に埋めるリバースアノテーションパイプラインを採用し、三相品質保証を通じて決定論的基底真理を生成する。ベンチマークは1,777のドキュメントと3つの構造カテゴリにまたがる1,771のユニークなスキーマで構成され、それぞれがプレーンテキスト、レイアウト保存テキスト(ホワイトスペースと近似列の位置)、ドキュメントイメージ、あるいはテキストと画像の組み合わせの4つの入力モードで提供される。単一の入力表現から評価される既存のベンチマークとは異なり、VAREXはドキュメント毎に4つの制御されたモダリティを提供しており、入力フォーマットが抽出精度にどう影響するかの体系的なアブレーションを可能にする。我々は、フロンティアプロプライエタリモデルから小さなオープンモデルまで20のモデルを評価し、特にコスト感受性と遅延制約のあるデプロイメントに適したモデル<=4Bパラメーターに注目した。その結果、(1)4Bパラメータ以下では、構造的出力コンプライアンス -- 抽出能力ではなく -- が支配的ボトルネックであり、特に、抽出された値の代わりにスキーマコンフォーミング構造を生成するモデル)は、影響を受けるモデルで45-65 pp(パーセント)のスコアを低下させ、(2)抽出特異的微調整は2Bの利得+81 ppの利得を達成し、命令追従欠陥がスケールなしで対応可能であること、(3)レイアウト保存テキストは、ピクセルレベルの視覚的手がかりを上回る最大の精度向上(+3-18 pp)を提供すること、(4) ベンチマークは60-95%の精度バンドで最も効果的にモデルを識別することを示した。データセットと評価コードは公開されています。

論文の概要: VAREX: A Benchmark for Multi-Modal Structured Extraction from Documents

関連論文リスト