BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction
- URL: http://arxiv.org/abs/2503.19658v1
- Date: Tue, 25 Mar 2025 13:46:55 GMT
- Title: BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction
- Authors: Jan Kohút, Martin Dočekal, Michal Hradiš, Marek Vaško,
- Abstract summary: BiblioPage is a dataset of scanned title pages annotated with structured metadata.<n>The dataset consists of approximately 2,000 title pages collected from 14 Czech libraries.<n>We valuated object detection models such as YOLO and DETR combined with transformer-based OCR, achieving a maximum mAP of 52 and an F1 score of 59.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Manual digitization of bibliographic metadata is time consuming and labor intensive, especially for historical and real-world archives with highly variable formatting across documents. Despite advances in machine learning, the absence of dedicated datasets for metadata extraction hinders automation. To address this gap, we introduce BiblioPage, a dataset of scanned title pages annotated with structured bibliographic metadata. The dataset consists of approximately 2,000 monograph title pages collected from 14 Czech libraries, spanning a wide range of publication periods, typographic styles, and layout structures. Each title page is annotated with 16 bibliographic attributes, including title, contributors, and publication metadata, along with precise positional information in the form of bounding boxes. To extract structured information from this dataset, we valuated object detection models such as YOLO and DETR combined with transformer-based OCR, achieving a maximum mAP of 52 and an F1 score of 59. Additionally, we assess the performance of various visual large language models, including LlamA 3.2-Vision and GPT-4o, with the best model reaching an F1 score of 67. BiblioPage serves as a real-world benchmark for bibliographic metadata extraction, contributing to document understanding, document question answering, and document information extraction. Dataset and evaluation scripts are availible at: https://github.com/DCGM/biblio-dataset
Related papers
- AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization [0.0]
The AnnoPage dataset is a collection of 7550 pages from historical documents, primarily in Czech and German, spanning from 1485 to the present.
The dataset is designed to support research in document layout analysis and object detection.
arXiv Detail & Related papers (2025-03-28T15:30:42Z) - TWIX: Automatically Reconstructing Structured Data from Templatized Documents [11.03654616939188]
Our tool TWIX predicts the underlying template used to create templatized documents.<n>TWIX achieves over 90% precision and recall on average, outperforming tools from industry.<n> TWIX scales easily to large datasets and is 734X faster and 5836X cheaper than vision-based LLMs for extracting data from a large document collection with 817 pages.
arXiv Detail & Related papers (2025-01-11T23:07:04Z) - CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation [51.2289822267563]
We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets.
We use large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents.
We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks.
arXiv Detail & Related papers (2024-09-03T17:54:40Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
Documents [122.55393759474181]
We introduce OBELICS, an open web-scale filtered dataset of interleaved image-text documents.
We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content.
We train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks.
arXiv Detail & Related papers (2023-06-21T14:01:01Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP.
The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z) - SDL: New data generation tools for full-level annotated document layout [0.0]
We present a novel data generation tool for document processing.
The tool focuses on providing a maximal level of visual information in a normal type document.
It also enables working with a large dataset on low-resource languages.
arXiv Detail & Related papers (2021-06-29T06:32:31Z) - AQuaMuSe: Automatically Generating Datasets for Query-Based
Multi-Document Summarization [17.098075160558576]
We propose a scalable approach called AQuaMuSe to automatically mine qMDS examples from question answering datasets and large document corpora.
We publicly release a specific instance of an AQuaMuSe dataset with 5,519 query-based summaries, each associated with an average of 6 input documents selected from an index of 355M documents from Common Crawl.
arXiv Detail & Related papers (2020-10-23T22:38:18Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - A Large-Scale Multi-Document Summarization Dataset from the Wikipedia
Current Events Portal [10.553314461761968]
Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries.
This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters.
arXiv Detail & Related papers (2020-05-20T14:33:33Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z) - A Large Dataset of Historical Japanese Documents with Complex Layouts [5.343406649012619]
HJDataset is a large dataset of historical Japanese documents with complex layouts.
It contains over 250,000 layout element annotations seven types.
A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors.
arXiv Detail & Related papers (2020-04-18T18:38:25Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.