Fugu-MT 論文翻訳(概要): Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

論文の概要: Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

arxiv url: http://arxiv.org/abs/2606.06242v1
Date: Thu, 04 Jun 2026 14:47:40 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.867243
Title: Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents
Title（参考訳）: 機関文書からのデータスナップショット抽出のためのオープンソースのレイアウト検出モデルのベンチマーク
Authors: AJ Carl P. Dy, Aivin V. Solatorio,
Abstract要約: テキストデータスナップショット抽出のためのベンチマークデータセットと評価フレームワークを提案する。複数のオープンソースのレイアウト検出モデルをベンチマークし、検出性能と空間抽出品質を評価した。これらの知見は、汎用文書レイアウト分析と運用上有用なデータスナップショット抽出の間に持続的なギャップを浮き彫りにする。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for \textit{data snapshot extraction}, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open-source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non-analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at https://huggingface.co/datasets/ai4data/data-snapshot and the source code is available at https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot.
Abstract（参考訳）: 機関文書には、図表や表に埋め込まれたかなりの量の運用情報と分析情報が含まれている。文書から視覚的コンテンツを抽出するための現在のアプローチは、主に汎用的な文書レイアウト分析に基づいて構築されており、図形と表は意味論的に意味のある分析成果物ではなく、一様に関連のある文書オブジェクトとして扱われる。本研究では,機関文書内の意味的に意味のある視覚的アーティファクトを識別・ローカライズするタスクであるtextit{data snapshot extract} のベンチマークデータセットと評価フレームワークを紹介する。このベンチマークは人道的な報告書、世界銀行の政策調査作業書類、プロジェクト評価文書にまたがっており、再利用可能な分析情報を含む数字や表のアノテーションを含んでいる。このデータセットを用いて,複数のオープンソースのレイアウト検出モデルをベンチマークし,検出性能と空間抽出品質の両方を評価した。本結果から,従来の学術ベンチマークでは高い性能を示しながらも,運用機関文書への一般化に苦慮していることが明らかとなった。一般的な障害モードには、分析的内容と非分析的内容の混同、複合的分析的アーティファクトの断片化、解釈に必要な文脈情報の不完全な抽出が含まれる。これらの知見は、汎用文書レイアウト分析と運用上有用なデータスナップショット抽出の間に持続的なギャップを浮き彫りにする。我々は、今後の運用文書インテリジェンス研究を支援するために、ソースPDF、アノテーションデータセット、メタデータ、ソースコードをリリースする。データセットはhttps://huggingface.co/datasets/ai4data/data-snapshotで、ソースコードはhttps://github.com/worldbank/ai4data/tree/main/experimental/data-snapshotで入手できる。

論文の概要: Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

関連論文リスト