Generating Synthetic Handwritten Historical Documents With OCR
Constrained GANs
- URL: http://arxiv.org/abs/2103.08236v1
- Date: Mon, 15 Mar 2021 09:39:17 GMT
- Title: Generating Synthetic Handwritten Historical Documents With OCR
Constrained GANs
- Authors: Lars V\"ogtlin, Manuel Drazyk, Vinaychandran Pondenkandath, Michele
Alberti, Rolf Ingold
- Abstract summary: We present a framework to generate synthetic historical documents with precise ground truth using nothing more than a collection of unlabeled historical images.
We demonstrate a high-quality synthesis that makes it possible to generate large labeled historical document datasets with precise ground truth.
- Score: 2.3808546906079178
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a framework to generate synthetic historical documents with
precise ground truth using nothing more than a collection of unlabeled
historical images. Obtaining large labeled datasets is often the limiting
factor to effectively use supervised deep learning methods for Document Image
Analysis (DIA). Prior approaches towards synthetic data generation either
require expertise or result in poor accuracy in the synthetic documents. To
achieve high precision transformations without requiring expertise, we tackle
the problem in two steps. First, we create template documents with
user-specified content and structure. Second, we transfer the style of a
collection of unlabeled historical images to these template documents while
preserving their text and layout. We evaluate the use of our synthetic
historical documents in a pre-training setting and find that we outperform the
baselines (randomly initialized and pre-trained). Additionally, with visual
examples, we demonstrate a high-quality synthesis that makes it possible to
generate large labeled historical document datasets with precise ground truth.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding [23.910783272007407]
This paper introduces SynthDoc, a novel synthetic document generation pipeline designed to enhance Visual Document Understanding (VDU)
Addressing the challenges of data acquisition and the limitations of existing datasets, SynthDoc leverages publicly available corpora and advanced rendering tools to create a comprehensive and versatile dataset.
Our experiments, conducted using the Donut model, demonstrate that models trained with SynthDoc's data achieve superior performance in pre-training read tasks and maintain robustness in downstream tasks, despite language inconsistencies.
arXiv Detail & Related papers (2024-08-27T03:31:24Z) - Improving Text Embeddings with Large Language Models [59.930513259982725]
We introduce a novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps.
We leverage proprietary LLMs to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
Experiments demonstrate that our method achieves strong performance on highly competitive text embedding benchmarks without using any labeled data.
arXiv Detail & Related papers (2023-12-31T02:13:18Z) - PHD: Pixel-Based Language Modeling of Historical Documents [55.75201940642297]
We propose a novel method for generating synthetic scans to resemble real historical documents.
We pre-train our model, PHD, on a combination of synthetic scans and real historical newspapers from the 1700-1900 period.
We successfully apply our model to a historical QA task, highlighting its usefulness in this domain.
arXiv Detail & Related papers (2023-10-22T08:45:48Z) - Synthetic Document Generator for Annotation-free Layout Recognition [15.657295650492948]
We describe a synthetic document generator that automatically produces realistic documents with labels for spatial positions, extents and categories of layout elements.
We empirically illustrate that a deep layout detection model trained purely on the synthetic documents can match the performance of a model that uses real documents.
arXiv Detail & Related papers (2021-11-11T01:58:44Z) - Synthesis in Style: Semantic Segmentation of Historical Documents using
Synthetic Data [12.704529528199062]
We propose a novel method for the synthesis of training data for semantic segmentation of document images.
We utilize clusters found in intermediate features of a StyleGAN generator for the synthesis of RGB and label images.
Our model can be applied to any dataset of scanned documents without the need for manual annotation of individual images.
arXiv Detail & Related papers (2021-07-14T15:36:47Z) - docExtractor: An off-the-shelf historical document element extraction [18.828438308738495]
We present docExtractor, a generic approach for extracting visual elements such as text lines or illustrations from historical documents.
We demonstrate it provides high-quality performances as an off-the-shelf system across a wide variety of datasets.
We introduce a new public dataset dubbed IlluHisDoc dedicated to the fine evaluation of illustration segmentation in historical documents.
arXiv Detail & Related papers (2020-12-15T10:19:18Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z) - Self-supervised Deep Reconstruction of Mixed Strip-shredded Text
Documents [63.41717168981103]
This work extends our previous deep learning method for single-page reconstruction to a more realistic/complex scenario.
In our approach, the compatibility evaluation is modeled as a two-class (valid or invalid) pattern recognition problem.
The proposed method outperforms the competing ones on complex scenarios, achieving accuracy superior to 90%.
arXiv Detail & Related papers (2020-07-01T21:48:05Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.