Augraphy: A Data Augmentation Library for Document Images
- URL: http://arxiv.org/abs/2208.14558v2
- Date: Fri, 24 Mar 2023 21:49:21 GMT
- Title: Augraphy: A Data Augmentation Library for Document Images
- Authors: Alexander Groleau, Kok Wei Chee, Stefan Larson, Samay Maini, Jonathan
Boarman
- Abstract summary: Augraphy is a Python library for constructing data augmentation pipelines.
It provides strategies to produce augmented versions of clean document images that appear to have been altered by standard office operations.
- Score: 59.457999432618614
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces Augraphy, a Python library for constructing data
augmentation pipelines which produce distortions commonly seen in real-world
document image datasets. Augraphy stands apart from other data augmentation
tools by providing many different strategies to produce augmented versions of
clean document images that appear as if they have been altered by standard
office operations, such as printing, scanning, and faxing through old or dirty
machines, degradation of ink over time, and handwritten markings. This paper
discusses the Augraphy tool, and shows how it can be used both as a data
augmentation tool for producing diverse training data for tasks such as
document denoising, and also for generating challenging test data to evaluate
model robustness on document image modeling tasks.
Related papers
- Automatic Recognition of Learning Resource Category in a Digital Library [6.865460045260549]
We introduce the Heterogeneous Learning Resources (HLR) dataset designed for document image classification.
The approach involves decomposing individual learning resources into constituent document images (sheets)
These images are then processed through an OCR tool to extract textual representation.
arXiv Detail & Related papers (2023-11-28T07:48:18Z) - Prompt me a Dataset: An investigation of text-image prompting for
historical image dataset creation using foundation models [0.9065034043031668]
We present a pipeline for image extraction from historical documents using foundation models.
We evaluate text-image prompts and their effectiveness on humanities datasets of varying levels of complexity.
arXiv Detail & Related papers (2023-09-04T15:37:03Z) - DocMAE: Document Image Rectification via Self-supervised Representation
Learning [144.44748607192147]
We present DocMAE, a novel self-supervised framework for document image rectification.
We first mask random patches of the background-excluded document images and then reconstruct the missing pixels.
With such a self-supervised learning approach, the network is encouraged to learn the intrinsic structure of deformed documents.
arXiv Detail & Related papers (2023-04-20T14:27:15Z) - ShabbyPages: A Reproducible Document Denoising and Binarization Dataset [59.457999432618614]
ShabbyPages is a new document image dataset designed for training and benchmarking document denoisers and binarizers.
In this paper, we discuss the creation process of ShabbyPages and demonstrate the utility of ShabbyPages by training convolutional denoisers which remove real noise features with a high degree of human-perceptible fidelity.
arXiv Detail & Related papers (2023-03-16T14:19:50Z) - DiT: Self-supervised Pre-training for Document Image Transformer [85.78807512344463]
We propose DiT, a self-supervised pre-trained Document Image Transformer model.
We leverage DiT as the backbone network in a variety of vision-based Document AI tasks.
Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-03-04T15:34:46Z) - Focused Attention Improves Document-Grounded Generation [111.42360617630669]
Document grounded generation is the task of using the information provided in a document to improve text generation.
This work focuses on two different document grounded generation tasks: Wikipedia Update Generation task and Dialogue response generation.
arXiv Detail & Related papers (2021-04-26T16:56:29Z) - Multiple Document Datasets Pre-training Improves Text Line Detection
With Deep Neural Networks [2.5352713493505785]
We introduce a fully convolutional network for the document layout analysis task.
Our method Doc-UFCN relies on a U-shaped model trained from scratch for detecting objects from historical documents.
We show that Doc-UFCN outperforms state-of-the-art methods on various datasets.
arXiv Detail & Related papers (2020-12-28T09:48:33Z) - OCR Graph Features for Manipulation Detection in Documents [11.193867567895353]
We propose a model that leverages graph features using OCR (Optical Character Recognition)
Our model relies on a data-driven approach to detect alterations by training a random forest classifier on the graph-based OCR features.
We evaluate our algorithm's forgery detection performance on dataset constructed from real business documents with slight forgery imperfections.
arXiv Detail & Related papers (2020-09-10T21:50:45Z) - From ImageNet to Image Classification: Contextualizing Progress on
Benchmarks [99.19183528305598]
We study how specific design choices in the ImageNet creation process impact the fidelity of the resulting dataset.
Our analysis pinpoints how a noisy data collection pipeline can lead to a systematic misalignment between the resulting benchmark and the real-world task it serves as a proxy for.
arXiv Detail & Related papers (2020-05-22T17:39:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.