Related papers: Diachronic Document Dataset for Semantic Layout Analysis

Diachronic Document Dataset for Semantic Layout Analysis

URL: http://arxiv.org/abs/2411.10068v1
Date: Fri, 15 Nov 2024 09:33:13 GMT
Title: Diachronic Document Dataset for Semantic Layout Analysis
Authors: Thibault Clérice, Juliette Janes, Hugo Scheithauer, Sarah Bénière, Florian Cafiero, Laurent Romary, Simon Gabay, Benoît Sagot,
Abstract summary: This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and born-digital materials. By incorporating content from different periods and genres, it addresses varying layout complexities and historical changes in document structure. We evaluate object detection models on this dataset, examining the impact of input size and subset-based training.
Score: 9.145289299764991
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a novel, open-access dataset designed for semantic layout analysis, built to support document recreation workflows through mapping with the Text Encoding Initiative (TEI) standard. This dataset includes 7,254 annotated pages spanning a large temporal range (1600-2024) of digitised and born-digital materials across diverse document types (magazines, papers from sciences and humanities, PhD theses, monographs, plays, administrative reports, etc.) sorted into modular subsets. By incorporating content from different periods and genres, it addresses varying layout complexities and historical changes in document structure. The modular design allows domain-specific configurations. We evaluate object detection models on this dataset, examining the impact of input size and subset-based training. Results show that a 1280-pixel input size for YOLO is optimal and that training on subsets generally benefits from incorporating them into a generic model rather than fine-tuning pre-trained weights.

Related papers

DREAM: Document Reconstruction via End-to-end Autoregressive Model [53.51754520966657]
We present an innovative autoregressive model specifically designed for document reconstruction, referred to as Document Reconstruction via End-to-end Autoregressive Model (DREAM)<n>We establish a standardized definition of the document reconstruction task, and introduce a novel Document Similarity Metric (DSM) and DocRec1K dataset for assessing the performance of the task.
arXiv Detail & Related papers (2025-07-08T09:24:07Z)
AnnoPage Dataset: Dataset of Non-Textual Elements in Documents with Fine-Grained Categorization [0.0]
The AnnoPage dataset is a collection of 7550 pages from historical documents, primarily in Czech and German, spanning from 1485 to the present. The dataset is designed to support research in document layout analysis and object detection.
arXiv Detail & Related papers (2025-03-28T15:30:42Z)
Spatial Information Integration in Small Language Models for Document Layout Generation and Classification [0.0]
Document layout understanding is a field of study that analyzes the spatial arrangement of information in a document to understand its structure and layout. While semi-structured data is common in everyday life (balance sheets, purchase orders, receipts), there is a lack of public datasets for training machine learning models for this type of document. We propose a method to generate new, synthetic, layout information that can help overcoming this data shortage.
arXiv Detail & Related papers (2025-01-09T17:20:00Z)
GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts [53.568057283934714]
We propose a VLM-based framework that generates content-aware text logo layouts. We introduce two model techniques to reduce the computation for processing multiple glyph images simultaneously. To support instruction-tuning of out model, we construct two extensive text logo datasets, which are 5x more larger than the existing public dataset.
arXiv Detail & Related papers (2024-11-18T10:04:10Z)
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction [23.47150047875133]
Document parsing is essential for converting unstructured and semi-structured documents into machine-readable data. Document parsing plays an indispensable role in both knowledge base construction and training data generation. This paper discusses the challenges faced by modular document parsing systems and vision-language models in handling complex layouts.
arXiv Detail & Related papers (2024-10-28T16:11:35Z)
Towards Unified Multi-granularity Text Detection with Interactive Attention [56.79437272168507]
"Detect Any Text" is an advanced paradigm that unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model. A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances. Tests demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks.
arXiv Detail & Related papers (2024-05-30T07:25:23Z)
RanLayNet: A Dataset for Document Layout Detection used for Domain Adaptation and Generalization [36.973388673687815]
RanLayNet is a synthetic document dataset enriched with automatically assigned labels. We show that a deep layout identification model trained on our dataset exhibits enhanced performance compared to a model trained solely on actual documents.
arXiv Detail & Related papers (2024-04-15T07:50:15Z)
Enhancing Visually-Rich Document Understanding via Layout Structure Modeling [91.07963806829237]
We propose GraphLM, a novel document understanding model that injects layout knowledge into the model. We evaluate our model on various benchmarks, including FUNSD, XFUND and CORD, and achieve state-of-the-art results.
arXiv Detail & Related papers (2023-08-15T13:53:52Z)
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents [122.55393759474181]
We introduce OBELICS, an open web-scale filtered dataset of interleaved image-text documents. We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content. We train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks.
arXiv Detail & Related papers (2023-06-21T14:01:01Z)
Robust Text Line Detection in Historical Documents: Learning and Evaluation Methods [1.9938405188113029]
We present a study conducted using three state-of-the-art systems Doc-UFCN, dhSegment and ARU-Net. We show that it is possible to build generic models trained on a wide variety of historical document datasets that can correctly segment diverse unseen pages.
arXiv Detail & Related papers (2022-03-23T11:56:25Z)
A Survey of Historical Document Image Datasets [2.8707038627097226]
This paper presents a systematic literature review of image datasets for document image analysis. It focuses on historical documents, such as handwritten manuscripts and early prints. Finding appropriate datasets for historical document analysis is a crucial prerequisite to facilitate research using different machine learning algorithms.
arXiv Detail & Related papers (2022-03-16T09:56:48Z)
Minimally-Supervised Structure-Rich Text Categorization via Learning on Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z)
WikiAsp: A Dataset for Multi-domain Aspect-based Summarization [69.13865812754058]
We propose WikiAsp, a large-scale dataset for multi-domain aspect-based summarization. Specifically, we build the dataset using Wikipedia articles from 20 different domains, using the section titles and boundaries of each article as a proxy for aspect annotation. Results highlight key challenges that existing summarization models face in this setting, such as proper pronoun handling of quoted sources and consistent explanation of time-sensitive events.
arXiv Detail & Related papers (2020-11-16T10:02:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.