Related papers: SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding

SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding

URL: http://arxiv.org/abs/2408.14764v1
Date: Tue, 27 Aug 2024 03:31:24 GMT
Title: SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding
Authors: Chuanghao Ding, Xuejing Liu, Wei Tang, Juan Li, Xiaoliang Wang, Rui Zhao, Cam-Tu Nguyen, Fei Tan,
Abstract summary: This paper introduces SynthDoc, a novel synthetic document generation pipeline designed to enhance Visual Document Understanding (VDU) Addressing the challenges of data acquisition and the limitations of existing datasets, SynthDoc leverages publicly available corpora and advanced rendering tools to create a comprehensive and versatile dataset. Our experiments, conducted using the Donut model, demonstrate that models trained with SynthDoc's data achieve superior performance in pre-training read tasks and maintain robustness in downstream tasks, despite language inconsistencies.
Score: 23.910783272007407
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces SynthDoc, a novel synthetic document generation pipeline designed to enhance Visual Document Understanding (VDU) by generating high-quality, diverse datasets that include text, images, tables, and charts. Addressing the challenges of data acquisition and the limitations of existing datasets, SynthDoc leverages publicly available corpora and advanced rendering tools to create a comprehensive and versatile dataset. Our experiments, conducted using the Donut model, demonstrate that models trained with SynthDoc's data achieve superior performance in pre-training read tasks and maintain robustness in downstream tasks, despite language inconsistencies. The release of a benchmark dataset comprising 5,000 image-text pairs not only showcases the pipeline's capabilities but also provides a valuable resource for the VDU community to advance research and development in document image recognition. This work significantly contributes to the field by offering a scalable solution to data scarcity and by validating the efficacy of end-to-end models in parsing complex, real-world documents.

Related papers

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation [79.71072337496351]
CoSyn is a framework that creates synthetic text-rich multimodal data. It can generate high-quality instruction-tuning data. It can also produce synthetic pointing data, enabling vision-language models to ground information within input images.
arXiv Detail & Related papers (2025-02-20T18:55:30Z)
RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm [34.02250139766494]
Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a variety of benchmarks. A substantial volume of multimodal interleaved documents remains underutilized for contrastive vision-language representation learning. We establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. We construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M.
arXiv Detail & Related papers (2025-02-18T03:58:38Z)
BigDocs: An Open Dataset for Training Multimodal Models on Document and Code Tasks [57.589795399265945]
We introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We also introduce BigDocs-Bench, a benchmark suite with 10 novel tasks. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o.
arXiv Detail & Related papers (2024-12-05T21:41:20Z)
Enhancing Document AI Data Generation Through Graph-Based Synthetic Layouts [0.8245350546263803]
We propose a novel approach to synthetic document layout generation using Graph Neural Networks (GNNs) By representing document elements as nodes in a graph, GNNs are trained to generate realistic and diverse document layouts. Our experimental results show that graph-augmented document layouts outperform existing augmentation techniques.
arXiv Detail & Related papers (2024-11-27T21:15:02Z)
SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles. Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z)
DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights [8.139817615390147]
This paper introduces the Domain Adaptive Visually-rich Document Understanding (DAViD) framework. DAViD integrates fine-grained and coarse-grained document representation learning and employs synthetic annotations to reduce the need for costly manual labelling.
arXiv Detail & Related papers (2024-10-02T14:47:55Z)
DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents [0.0]
Document semantic segmentation can facilitate document analysis tasks, including OCR, form classification, and document editing. Several synthetic datasets have been developed to distinguish handwriting from printed text, but they fall short in class variety and document diversity. We propose the most comprehensive document semantic segmentation pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources. Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research.
arXiv Detail & Related papers (2024-04-30T04:53:10Z)
Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP. Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z)
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources. Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision. We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)
Synthesis in Style: Semantic Segmentation of Historical Documents using Synthetic Data [12.704529528199062]
We propose a novel method for the synthesis of training data for semantic segmentation of document images. We utilize clusters found in intermediate features of a StyleGAN generator for the synthesis of RGB and label images. Our model can be applied to any dataset of scanned documents without the need for manual annotation of individual images.
arXiv Detail & Related papers (2021-07-14T15:36:47Z)
DocSynth: A Layout Guided Approach for Controllable Document Image Synthesis [16.284895792639137]
This paper presents a novel approach, called Doc Synth, to automatically synthesize document images based on a given layout. In this work, given a spatial layout (bounding boxes with object categories) as a reference by the user, our proposed Doc Synth model learns to generate a set of realistic document images. The results highlight that our model can successfully generate realistic and diverse document images with multiple objects.
arXiv Detail & Related papers (2021-07-06T14:24:30Z)
docExtractor: An off-the-shelf historical document element extraction [18.828438308738495]
We present docExtractor, a generic approach for extracting visual elements such as text lines or illustrations from historical documents. We demonstrate it provides high-quality performances as an off-the-shelf system across a wide variety of datasets. We introduce a new public dataset dubbed IlluHisDoc dedicated to the fine evaluation of illustration segmentation in historical documents.
arXiv Detail & Related papers (2020-12-15T10:19:18Z)
Unsupervised Opinion Summarization with Content Planning [58.5308638148329]
We show that explicitly incorporating content planning in a summarization model yields output of higher quality. We also create synthetic datasets which are more natural, resembling real world document-summary pairs. Our approach outperforms competitive models in generating informative, coherent, and fluent summaries.
arXiv Detail & Related papers (2020-12-14T18:41:58Z)
Leveraging Graph to Improve Abstractive Multi-Document Summarization [50.62418656177642]
We develop a neural abstractive multi-document summarization (MDS) model which can leverage well-known graph representations of documents. Our model utilizes graphs to encode documents in order to capture cross-document relations, which is crucial to summarizing long documents. Our model can also take advantage of graphs to guide the summary generation process, which is beneficial for generating coherent and concise summaries.
arXiv Detail & Related papers (2020-05-20T13:39:47Z)
SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level. We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks. We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.