SIMARA: a database for key-value information extraction from full pages
- URL: http://arxiv.org/abs/2304.13606v1
- Date: Wed, 26 Apr 2023 15:00:04 GMT
- Title: SIMARA: a database for key-value information extraction from full pages
- Authors: Sol\`ene Tarride and M\'elodie Boillet and Jean-Fran\c{c}ois Moufflet
and Christopher Kermorvant
- Abstract summary: We propose a new database for information extraction from historical handwritten documents.
The corpus includes 5,393 finding aids from six different series, dating from the 18th-20th centuries.
Finding aids are handwritten documents that contain metadata describing older archives.
- Score: 0.1835211348413763
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a new database for information extraction from historical
handwritten documents. The corpus includes 5,393 finding aids from six
different series, dating from the 18th-20th centuries. Finding aids are
handwritten documents that contain metadata describing older archives. They are
stored in the National Archives of France and are used by archivists to
identify and find archival documents. Each document is annotated at page-level,
and contains seven fields to retrieve. The localization of each field is not
available in such a way that this dataset encourages research on
segmentation-free systems for information extraction. We propose a model based
on the Transformer architecture trained for end-to-end information extraction
and provide three sets for training, validation and testing, to ensure fair
comparison with future works. The database is freely accessible at
https://zenodo.org/record/7868059.
Related papers
- CRAWLDoc: A Dataset for Robust Ranking of Bibliographic Documents [3.3916160303055563]
CRAWLDoc is a new method for contextual ranking of linked web documents.<n>It retrieves the landing page and all linked web resources, including PDFs, profiles, and supplementary materials.<n>It embeds these resources, along with anchor texts and the URLs, into a unified representation.
arXiv Detail & Related papers (2025-06-04T10:52:55Z) - Unifying Multimodal Retrieval via Document Screenshot Embedding [92.03571344075607]
Document Screenshot Embedding (DSE) is a novel retrieval paradigm that regards document screenshots as a unified input format.
We first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset.
In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing.
arXiv Detail & Related papers (2024-06-17T06:27:35Z) - PDFTriage: Question Answering over Long, Structured Documents [60.96667912964659]
Representing structured documents as plain text is incongruous with the user's mental model of these documents with rich structure.
We propose PDFTriage that enables models to retrieve the context based on either structure or content.
Our benchmark dataset consists of 900+ human-generated questions over 80 structured documents.
arXiv Detail & Related papers (2023-09-16T04:29:05Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - Document Layout Annotation: Database and Benchmark in the Domain of
Public Affairs [62.38140271294419]
We propose a procedure to semi-automatically annotate digital documents with different layout labels.
We collect a novel database for DLA in the public affairs domain using a set of 24 data sources from the Spanish Administration.
The results of our experiments validate the proposed text labeling procedure with accuracy up to 99%.
arXiv Detail & Related papers (2023-06-12T08:21:50Z) - DAPR: A Benchmark on Document-Aware Passage Retrieval [57.45793782107218]
We propose and name this task emphDocument-Aware Passage Retrieval (DAPR)
While analyzing the errors of the State-of-The-Art (SoTA) passage retrievers, we find the major errors (53.5%) are due to missing document context.
Our created benchmark enables future research on developing and comparing retrieval systems for the new task.
arXiv Detail & Related papers (2023-05-23T10:39:57Z) - CED: Catalog Extraction from Documents [12.037861186708799]
We propose a transition-based framework for parsing documents into catalog trees.
We believe the CED task could fill the gap between raw text segments and information extraction tasks on extremely long documents.
arXiv Detail & Related papers (2023-04-28T07:32:00Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Combining Deep Learning and Reasoning for Address Detection in
Unstructured Text Documents [0.0]
We propose a hybrid approach that combines deep learning with reasoning for finding and extracting addresses from unstructured text documents.
We use a visual deep learning model to detect the boundaries of possible address regions on the scanned document images.
arXiv Detail & Related papers (2022-02-07T12:32:00Z) - Identifying Documents In-Scope of a Collection from Web Archives [37.34941845795198]
We study both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document.
We focus our evaluation on three datasets that we created from three different Web archives.
Our experimental results show that the BoW classifiers that focus only on specific portions of the documents (rather than the full text) outperform all compared methods on all three datasets.
arXiv Detail & Related papers (2020-09-02T16:22:23Z) - A Large Dataset of Historical Japanese Documents with Complex Layouts [5.343406649012619]
HJDataset is a large dataset of historical Japanese documents with complex layouts.
It contains over 250,000 layout element annotations seven types.
A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors.
arXiv Detail & Related papers (2020-04-18T18:38:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.