DocEnTr: An End-to-End Document Image Enhancement Transformer
- URL: http://arxiv.org/abs/2201.10252v1
- Date: Tue, 25 Jan 2022 11:45:35 GMT
- Title: DocEnTr: An End-to-End Document Image Enhancement Transformer
- Authors: Mohamed Ali Souibgui, Sanket Biswas, Sana Khamekhem Jemni, Yousri
Kessentini, Alicia Forn\'es, Josep Llad\'os, Umapada Pal
- Abstract summary: Document images can be affected by many degradation scenarios, which cause recognition and processing difficulties.
We present a new encoder-decoder architecture based on vision transformers to enhance both machine-printed and handwritten document images.
- Score: 13.108797370734893
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Document images can be affected by many degradation scenarios, which cause
recognition and processing difficulties. In this age of digitization, it is
important to denoise them for proper usage. To address this challenge, we
present a new encoder-decoder architecture based on vision transformers to
enhance both machine-printed and handwritten document images, in an end-to-end
fashion. The encoder operates directly on the pixel patches with their
positional information without the use of any convolutional layers, while the
decoder reconstructs a clean image from the encoded patches. Conducted
experiments show a superiority of the proposed model compared to the state-of
the-art methods on several DIBCO benchmarks. Code and models will be publicly
available at: \url{https://github.com/dali92002/DocEnTR}.
Related papers
- $ε$-VAE: Denoising as Visual Decoding [61.29255979767292]
In generative modeling, tokenization simplifies complex data into compact, structured representations, creating a more efficient, learnable space.
Current visual tokenization methods rely on a traditional autoencoder framework, where the encoder compresses data into latent representations, and the decoder reconstructs the original input.
We propose denoising as decoding, shifting from single-step reconstruction to iterative refinement. Specifically, we replace the decoder with a diffusion process that iteratively refines noise to recover the original image, guided by the latents provided by the encoder.
We evaluate our approach by assessing both reconstruction (rFID) and generation quality (
arXiv Detail & Related papers (2024-10-05T08:27:53Z) - UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - HybridFlow: Infusing Continuity into Masked Codebook for Extreme Low-Bitrate Image Compression [51.04820313355164]
HyrbidFlow combines the continuous-feature-based and codebook-based streams to achieve both high perceptual quality and high fidelity under extreme lows.
Experimental results demonstrate superior performance across several datasets under extremely lows.
arXiv Detail & Related papers (2024-04-20T13:19:08Z) - A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical
Document Image Enhancement [13.27528507177775]
We propose textbfT2T-BinFormer which is a novel document binarization encoder-decoder architecture based on a Tokens-to-token vision transformer.
Experiments on various DIBCO and H-DIBCO benchmarks demonstrate that the proposed model outperforms the existing CNN and ViT-based state-of-the-art methods.
arXiv Detail & Related papers (2023-12-06T23:01:11Z) - DocBinFormer: A Two-Level Transformer Network for Effective Document
Image Binarization [17.087982099845156]
Document binarization is a fundamental and crucial step for achieving the most optimal performance in any document analysis task.
We propose DocBinFormer, a novel two-level vision transformer (TL-ViT) architecture based on vision transformers for effective document image binarization.
arXiv Detail & Related papers (2023-12-06T16:01:29Z) - Deep Unrestricted Document Image Rectification [110.61517455253308]
We present DocTr++, a novel unified framework for document image rectification.
We upgrade the original architecture by adopting a hierarchical encoder-decoder structure for multi-scale representation extraction and parsing.
We contribute a real-world test set and metrics applicable for evaluating the rectification quality.
arXiv Detail & Related papers (2023-04-18T08:00:54Z) - StegaPos: Preventing Crops and Splices with Imperceptible Positional
Encodings [0.0]
We present a model for distinguishing between images that are authentic copies of ones published by photographers.
The model comprises an encoder that resides with the photographer and a matching decoder that is available to observers.
We find that training the encoder and decoder together produces a model that imperceptibly encodes position.
arXiv Detail & Related papers (2021-04-25T23:42:29Z) - Swapping Autoencoder for Deep Image Manipulation [94.33114146172606]
We propose the Swapping Autoencoder, a deep model designed specifically for image manipulation.
The key idea is to encode an image with two independent components and enforce that any swapped combination maps to a realistic image.
Experiments on multiple datasets show that our model produces better results and is substantially more efficient compared to recent generative models.
arXiv Detail & Related papers (2020-07-01T17:59:57Z) - Modeling Lost Information in Lossy Image Compression [72.69327382643549]
Lossy image compression is one of the most commonly used operators for digital images.
We propose a novel invertible framework called Invertible Lossy Compression (ILC) to largely mitigate the information loss problem.
arXiv Detail & Related papers (2020-06-22T04:04:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.