Related papers: ShabbyPages: A Reproducible Document Denoising and Binarization Dataset

ShabbyPages: A Reproducible Document Denoising and Binarization Dataset

URL: http://arxiv.org/abs/2303.09339v2
Date: Fri, 17 Mar 2023 19:48:36 GMT
Title: ShabbyPages: A Reproducible Document Denoising and Binarization Dataset
Authors: Alexander Groleau, Kok Wei Chee, Stefan Larson, Samay Maini, Jonathan Boarman
Abstract summary: ShabbyPages is a new document image dataset designed for training and benchmarking document denoisers and binarizers. In this paper, we discuss the creation process of ShabbyPages and demonstrate the utility of ShabbyPages by training convolutional denoisers which remove real noise features with a high degree of human-perceptible fidelity.
Score: 59.457999432618614
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Document denoising and binarization are fundamental problems in the document processing space, but current datasets are often too small and lack sufficient complexity to effectively train and benchmark modern data-driven machine learning models. To fill this gap, we introduce ShabbyPages, a new document image dataset designed for training and benchmarking document denoisers and binarizers. ShabbyPages contains over 6,000 clean "born digital" images with synthetically-noised counterparts ("shabby pages") that were augmented using the Augraphy document augmentation tool to appear as if they have been printed and faxed, photocopied, or otherwise altered through physical processes. In this paper, we discuss the creation process of ShabbyPages and demonstrate the utility of ShabbyPages by training convolutional denoisers which remove real noise features with a high degree of human-perceptible fidelity, establishing baseline performance for a new ShabbyPages benchmark.

Related papers

Context-Aware Classification of Legal Document Pages [7.306025535482021]
We present a simple but effective approach that overcomes the constraint on input length. Specifically, we enhance the input with extra tokens carrying sequential information about previous pages. Our experiments conducted on two legal datasets in English and Portuguese respectively show that the proposed approach can significantly improve the performance of document page classification.
arXiv Detail & Related papers (2023-04-05T23:14:58Z)
EraseNet: A Recurrent Residual Network for Supervised Document Cleaning [0.0]
This paper introduces a supervised approach for cleaning dirty documents using a new fully convolutional auto-encoder architecture. The experiments in this paper have shown promising results as the model is able to learn a variety of ordinary as well as unusual noises and rectify them efficiently.
arXiv Detail & Related papers (2022-10-03T04:23:25Z)
Augraphy: A Data Augmentation Library for Document Images [59.457999432618614]
Augraphy is a Python library for constructing data augmentation pipelines. It provides strategies to produce augmented versions of clean document images that appear to have been altered by standard office operations.
arXiv Detail & Related papers (2022-08-30T22:36:19Z)
Boosting Modern and Historical Handwritten Text Recognition with Deformable Convolutions [52.250269529057014]
Handwritten Text Recognition (HTR) in free-volution pages is a challenging image understanding task. We propose to adopt deformable convolutions, which can deform depending on the input at hand and better adapt to the geometric variations of the text.
arXiv Detail & Related papers (2022-08-17T06:55:54Z)
Fourier Document Restoration for Robust Document Dewarping and Recognition [73.44057202891011]
This paper presents FDRNet, a Fourier Document Restoration Network that can restore documents with different distortions. It dewarps documents by a flexible Thin-Plate Spline transformation which can handle various deformations effectively without requiring deformation annotations in training. It outperforms the state-of-the-art by large margins on both dewarping and text recognition tasks.
arXiv Detail & Related papers (2022-03-18T12:39:31Z)
DocScanner: Robust Document Image Rectification with Progressive Learning [162.03694280524084]
This work presents DocScanner, a new deep network architecture for document image rectification. DocScanner maintains a single estimate of the rectified image, which is progressively corrected with a recurrent architecture. The iterative refinements make DocScanner converge to a robust and superior performance, and the lightweight recurrent architecture ensures the running efficiency.
arXiv Detail & Related papers (2021-10-28T09:15:02Z)
Graph-based Deep Generative Modelling for Document Layout Generation [14.907063348987075]
We have proposed an automated deep generative model using Graph Neural Networks (GNNs) to generate synthetic data with highly variable and plausible document layouts. It is also the first graph-based approach for document layout generation task experimented on administrative document images.
arXiv Detail & Related papers (2021-07-09T10:49:49Z)
Multiple Document Datasets Pre-training Improves Text Line Detection With Deep Neural Networks [2.5352713493505785]
We introduce a fully convolutional network for the document layout analysis task. Our method Doc-UFCN relies on a U-shaped model trained from scratch for detecting objects from historical documents. We show that Doc-UFCN outperforms state-of-the-art methods on various datasets.
arXiv Detail & Related papers (2020-12-28T09:48:33Z)
Self-supervised Deep Reconstruction of Mixed Strip-shredded Text Documents [63.41717168981103]
This work extends our previous deep learning method for single-page reconstruction to a more realistic/complex scenario. In our approach, the compatibility evaluation is modeled as a two-class (valid or invalid) pattern recognition problem. The proposed method outperforms the competing ones on complex scenarios, achieving accuracy superior to 90%.
arXiv Detail & Related papers (2020-07-01T21:48:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.