SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding
- URL: http://arxiv.org/abs/2408.14764v1
- Date: Tue, 27 Aug 2024 03:31:24 GMT
- Title: SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding
- Authors: Chuanghao Ding, Xuejing Liu, Wei Tang, Juan Li, Xiaoliang Wang, Rui Zhao, Cam-Tu Nguyen, Fei Tan,
- Abstract summary: This paper introduces SynthDoc, a novel synthetic document generation pipeline designed to enhance Visual Document Understanding (VDU)
Addressing the challenges of data acquisition and the limitations of existing datasets, SynthDoc leverages publicly available corpora and advanced rendering tools to create a comprehensive and versatile dataset.
Our experiments, conducted using the Donut model, demonstrate that models trained with SynthDoc's data achieve superior performance in pre-training read tasks and maintain robustness in downstream tasks, despite language inconsistencies.
- Score: 23.910783272007407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces SynthDoc, a novel synthetic document generation pipeline designed to enhance Visual Document Understanding (VDU) by generating high-quality, diverse datasets that include text, images, tables, and charts. Addressing the challenges of data acquisition and the limitations of existing datasets, SynthDoc leverages publicly available corpora and advanced rendering tools to create a comprehensive and versatile dataset. Our experiments, conducted using the Donut model, demonstrate that models trained with SynthDoc's data achieve superior performance in pre-training read tasks and maintain robustness in downstream tasks, despite language inconsistencies. The release of a benchmark dataset comprising 5,000 image-text pairs not only showcases the pipeline's capabilities but also provides a valuable resource for the VDU community to advance research and development in document image recognition. This work significantly contributes to the field by offering a scalable solution to data scarcity and by validating the efficacy of end-to-end models in parsing complex, real-world documents.
Related papers
- Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation [79.71072337496351]
CoSyn is a framework that creates synthetic text-rich multimodal data.
It can generate high-quality instruction-tuning data.
It can also produce synthetic pointing data, enabling vision-language models to ground information within input images.
arXiv Detail & Related papers (2025-02-20T18:55:30Z) - RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm [34.02250139766494]
A substantial volume of non-paired data, such as multimodal interleaved documents, remains underutilized for vision-language representation learning.
We establish a Real-World Data Extraction pipeline to extract high-quality images and texts.
Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts.
We construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales.
arXiv Detail & Related papers (2025-02-18T03:58:38Z) - Enhancing Document AI Data Generation Through Graph-Based Synthetic Layouts [0.8245350546263803]
We propose a novel approach to synthetic document layout generation using Graph Neural Networks (GNNs)
By representing document elements as nodes in a graph, GNNs are trained to generate realistic and diverse document layouts.
Our experimental results show that graph-augmented document layouts outperform existing augmentation techniques.
arXiv Detail & Related papers (2024-11-27T21:15:02Z) - SciER: An Entity and Relation Extraction Dataset for Datasets, Methods, and Tasks in Scientific Documents [49.54155332262579]
We release a new entity and relation extraction dataset for entities related to datasets, methods, and tasks in scientific articles.
Our dataset contains 106 manually annotated full-text scientific publications with over 24k entities and 12k relations.
arXiv Detail & Related papers (2024-10-28T15:56:49Z) - DAViD: Domain Adaptive Visually-Rich Document Understanding with Synthetic Insights [8.139817615390147]
This paper introduces the Domain Adaptive Visually-rich Document Understanding (DAViD) framework.
DAViD integrates fine-grained and coarse-grained document representation learning and employs synthetic annotations to reduce the need for costly manual labelling.
arXiv Detail & Related papers (2024-10-02T14:47:55Z) - DELINE8K: A Synthetic Data Pipeline for the Semantic Segmentation of Historical Documents [0.0]
Document semantic segmentation can facilitate document analysis tasks, including OCR, form classification, and document editing.
Several synthetic datasets have been developed to distinguish handwriting from printed text, but they fall short in class variety and document diversity.
We propose the most comprehensive document semantic segmentation pipeline to date, incorporating preprinted text, handwriting, and document backgrounds from over 10 sources.
Our customized dataset exhibits superior performance on the NAFSS benchmark, demonstrating it as a promising tool in further research.
arXiv Detail & Related papers (2024-04-30T04:53:10Z) - Visually Guided Generative Text-Layout Pre-training for Document Intelligence [51.09853181377696]
We propose visually guided generative text-pre-training, named ViTLP.
Given a document image, the model optimize hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence.
ViTLP can function as a native OCR model to localize and recognize texts of document images.
arXiv Detail & Related papers (2024-03-25T08:00:43Z) - docExtractor: An off-the-shelf historical document element extraction [18.828438308738495]
We present docExtractor, a generic approach for extracting visual elements such as text lines or illustrations from historical documents.
We demonstrate it provides high-quality performances as an off-the-shelf system across a wide variety of datasets.
We introduce a new public dataset dubbed IlluHisDoc dedicated to the fine evaluation of illustration segmentation in historical documents.
arXiv Detail & Related papers (2020-12-15T10:19:18Z) - Unsupervised Opinion Summarization with Content Planning [58.5308638148329]
We show that explicitly incorporating content planning in a summarization model yields output of higher quality.
We also create synthetic datasets which are more natural, resembling real world document-summary pairs.
Our approach outperforms competitive models in generating informative, coherent, and fluent summaries.
arXiv Detail & Related papers (2020-12-14T18:41:58Z) - Leveraging Graph to Improve Abstractive Multi-Document Summarization [50.62418656177642]
We develop a neural abstractive multi-document summarization (MDS) model which can leverage well-known graph representations of documents.
Our model utilizes graphs to encode documents in order to capture cross-document relations, which is crucial to summarizing long documents.
Our model can also take advantage of graphs to guide the summary generation process, which is beneficial for generating coherent and concise summaries.
arXiv Detail & Related papers (2020-05-20T13:39:47Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.