Related papers: Augraphy: A Data Augmentation Library for Document Images

Augraphy: A Data Augmentation Library for Document Images

URL: http://arxiv.org/abs/2208.14558v2
Date: Fri, 24 Mar 2023 21:49:21 GMT
Title: Augraphy: A Data Augmentation Library for Document Images
Authors: Alexander Groleau, Kok Wei Chee, Stefan Larson, Samay Maini, Jonathan Boarman
Abstract summary: Augraphy is a Python library for constructing data augmentation pipelines. It provides strategies to produce augmented versions of clean document images that appear to have been altered by standard office operations.
Score: 59.457999432618614
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces Augraphy, a Python library for constructing data augmentation pipelines which produce distortions commonly seen in real-world document image datasets. Augraphy stands apart from other data augmentation tools by providing many different strategies to produce augmented versions of clean document images that appear as if they have been altered by standard office operations, such as printing, scanning, and faxing through old or dirty machines, degradation of ink over time, and handwritten markings. This paper discusses the Augraphy tool, and shows how it can be used both as a data augmentation tool for producing diverse training data for tasks such as document denoising, and also for generating challenging test data to evaluate model robustness on document image modeling tasks.

Related papers

Leveraging Contrastive Learning for a Similarity-Guided Tampered Document Data Generation Pipeline [6.066442015301665]
We propose a novel method for generating high-quality tampered document images.<n>We first train an auxiliary network to compare text crops, leveraging contrastive learning with a novel strategy for defining positive pairs and their corresponding negatives.<n>Using a carefully designed generation pipeline, we introduce a framework capable of producing diverse, high-quality tampered document images.
arXiv Detail & Related papers (2026-02-19T12:39:38Z)
Application of deep learning approaches for medieval historical documents transcription [0.0]
This paper presents a deep learning method to extract text information from handwritten Latin-language documents of the 9th to 11th centuries.<n>The approach takes into account the properties inherent in medieval documents.<n>The implementation is published on the GitHub repository.
arXiv Detail & Related papers (2025-12-21T19:43:30Z)
Automatic Recognition of Learning Resource Category in a Digital Library [6.865460045260549]
We introduce the Heterogeneous Learning Resources (HLR) dataset designed for document image classification. The approach involves decomposing individual learning resources into constituent document images (sheets) These images are then processed through an OCR tool to extract textual representation.
arXiv Detail & Related papers (2023-11-28T07:48:18Z)
Prompt me a Dataset: An investigation of text-image prompting for historical image dataset creation using foundation models [0.9065034043031668]
We present a pipeline for image extraction from historical documents using foundation models. We evaluate text-image prompts and their effectiveness on humanities datasets of varying levels of complexity.
arXiv Detail & Related papers (2023-09-04T15:37:03Z)
DocMAE: Document Image Rectification via Self-supervised Representation Learning [144.44748607192147]
We present DocMAE, a novel self-supervised framework for document image rectification. We first mask random patches of the background-excluded document images and then reconstruct the missing pixels. With such a self-supervised learning approach, the network is encouraged to learn the intrinsic structure of deformed documents.
arXiv Detail & Related papers (2023-04-20T14:27:15Z)
ShabbyPages: A Reproducible Document Denoising and Binarization Dataset [59.457999432618614]
ShabbyPages is a new document image dataset designed for training and benchmarking document denoisers and binarizers. In this paper, we discuss the creation process of ShabbyPages and demonstrate the utility of ShabbyPages by training convolutional denoisers which remove real noise features with a high degree of human-perceptible fidelity.
arXiv Detail & Related papers (2023-03-16T14:19:50Z)
DiT: Self-supervised Pre-training for Document Image Transformer [85.78807512344463]
We propose DiT, a self-supervised pre-trained Document Image Transformer model. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results.
arXiv Detail & Related papers (2022-03-04T15:34:46Z)
Focused Attention Improves Document-Grounded Generation [111.42360617630669]
Document grounded generation is the task of using the information provided in a document to improve text generation. This work focuses on two different document grounded generation tasks: Wikipedia Update Generation task and Dialogue response generation.
arXiv Detail & Related papers (2021-04-26T16:56:29Z)
Multiple Document Datasets Pre-training Improves Text Line Detection With Deep Neural Networks [2.5352713493505785]
We introduce a fully convolutional network for the document layout analysis task. Our method Doc-UFCN relies on a U-shaped model trained from scratch for detecting objects from historical documents. We show that Doc-UFCN outperforms state-of-the-art methods on various datasets.
arXiv Detail & Related papers (2020-12-28T09:48:33Z)
OCR Graph Features for Manipulation Detection in Documents [11.193867567895353]
We propose a model that leverages graph features using OCR (Optical Character Recognition) Our model relies on a data-driven approach to detect alterations by training a random forest classifier on the graph-based OCR features. We evaluate our algorithm's forgery detection performance on dataset constructed from real business documents with slight forgery imperfections.
arXiv Detail & Related papers (2020-09-10T21:50:45Z)
From ImageNet to Image Classification: Contextualizing Progress on Benchmarks [99.19183528305598]
We study how specific design choices in the ImageNet creation process impact the fidelity of the resulting dataset. Our analysis pinpoints how a noisy data collection pipeline can lead to a systematic misalignment between the resulting benchmark and the real-world task it serves as a proxy for.
arXiv Detail & Related papers (2020-05-22T17:39:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.