DocBinFormer: A Two-Level Transformer Network for Effective Document
Image Binarization
- URL: http://arxiv.org/abs/2312.03568v1
- Date: Wed, 6 Dec 2023 16:01:29 GMT
- Title: DocBinFormer: A Two-Level Transformer Network for Effective Document
Image Binarization
- Authors: Risab Biswas, Swalpa Kumar Roy, Ning Wang, Umapada Pal, Guang-Bin
Huang
- Abstract summary: Document binarization is a fundamental and crucial step for achieving the most optimal performance in any document analysis task.
We propose DocBinFormer, a novel two-level vision transformer (TL-ViT) architecture based on vision transformers for effective document image binarization.
- Score: 17.087982099845156
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In real life, various degradation scenarios exist that might damage document
images, making it harder to recognize and analyze them, thus binarization is a
fundamental and crucial step for achieving the most optimal performance in any
document analysis task. We propose DocBinFormer (Document Binarization
Transformer), a novel two-level vision transformer (TL-ViT) architecture based
on vision transformers for effective document image binarization. The presented
architecture employs a two-level transformer encoder to effectively capture
both global and local feature representation from the input images. These
complimentary bi-level features are exploited for efficient document image
binarization, resulting in improved results for system-generated as well as
handwritten document images in a comprehensive approach. With the absence of
convolutional layers, the transformer encoder uses the pixel patches and
sub-patches along with their positional information to operate directly on
them, while the decoder generates a clean (binarized) output image from the
latent representation of the patches. Instead of using a simple vision
transformer block to extract information from the image patches, the proposed
architecture uses two transformer blocks for greater coverage of the extracted
feature space on a global and local scale. The encoded feature representation
is used by the decoder block to generate the corresponding binarized output.
Extensive experiments on a variety of DIBCO and H-DIBCO benchmarks show that
the proposed model outperforms state-of-the-art techniques on four metrics. The
source code will be made available at
https://github.com/RisabBiswas/DocBinFormer.
Related papers
- Neuromorphic Synergy for Video Binarization [54.195375576583864]
Bimodal objects serve as a visual form to embed information that can be easily recognized by vision systems.
Neuromorphic cameras offer new capabilities for alleviating motion blur, but it is non-trivial to first de-blur and then binarize the images in a real-time manner.
We propose an event-based binary reconstruction method that leverages the prior knowledge of the bimodal target's properties to perform inference independently in both event space and image space.
We also develop an efficient integration method to propagate this binary image to high frame rate binary video.
arXiv Detail & Related papers (2024-02-20T01:43:51Z) - A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical
Document Image Enhancement [13.27528507177775]
We propose textbfT2T-BinFormer which is a novel document binarization encoder-decoder architecture based on a Tokens-to-token vision transformer.
Experiments on various DIBCO and H-DIBCO benchmarks demonstrate that the proposed model outperforms the existing CNN and ViT-based state-of-the-art methods.
arXiv Detail & Related papers (2023-12-06T23:01:11Z) - TransY-Net:Learning Fully Transformer Networks for Change Detection of
Remote Sensing Images [64.63004710817239]
We propose a novel Transformer-based learning framework named TransY-Net for remote sensing image CD.
It improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner.
Our proposed method achieves a new state-of-the-art performance on four optical and two SAR image CD benchmarks.
arXiv Detail & Related papers (2023-10-22T07:42:19Z) - Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - Xformer: Hybrid X-Shaped Transformer for Image Denoising [114.37510775636811]
We present a hybrid X-shaped vision Transformer, named Xformer, which performs notably on image denoising tasks.
Xformer achieves state-of-the-art performance on the synthetic and real-world image denoising tasks.
arXiv Detail & Related papers (2023-03-11T16:32:09Z) - Document Image Binarization in JPEG Compressed Domain using Dual
Discriminator Generative Adversarial Networks [0.0]
The proposed model has been thoroughly tested with different versions of DIBCO dataset having challenges like holes, erased or smudged ink, dust, and misplaced fibres.
The model proved to be highly robust, efficient both in terms of time and space complexities, and also resulted in state-of-the-art performance in JPEG compressed domain.
arXiv Detail & Related papers (2022-09-13T12:07:32Z) - DocEnTr: An End-to-End Document Image Enhancement Transformer [13.108797370734893]
Document images can be affected by many degradation scenarios, which cause recognition and processing difficulties.
We present a new encoder-decoder architecture based on vision transformers to enhance both machine-printed and handwritten document images.
arXiv Detail & Related papers (2022-01-25T11:45:35Z) - Towards End-to-End Image Compression and Analysis with Transformers [99.50111380056043]
We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application.
We aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression with the long-term information from the Transformer.
Experimental results demonstrate the effectiveness of the proposed model in both the image compression and the classification tasks.
arXiv Detail & Related papers (2021-12-17T03:28:14Z) - Uformer: A General U-Shaped Transformer for Image Restoration [47.60420806106756]
We build a hierarchical encoder-decoder network using the Transformer block for image restoration.
Experiments on several image restoration tasks demonstrate the superiority of Uformer.
arXiv Detail & Related papers (2021-06-06T12:33:22Z) - Two-stage generative adversarial networks for document image
binarization with color noise and background removal [7.639067237772286]
We propose a two-stage color document image enhancement and binarization method using generative adversarial neural networks.
In the first stage, four color-independent adversarial networks are trained to extract color foreground information from an input image.
In the second stage, two independent adversarial networks with global and local features are trained for image binarization of documents of variable size.
arXiv Detail & Related papers (2020-10-20T07:51:50Z) - Two-stream Encoder-Decoder Network for Localizing Image Forgeries [4.982505311411925]
We propose a novel two-stream encoder-decoder network, which utilizes both the high-level and the low-level image features.
We have carried out experimental analysis on multiple standard forensics datasets to evaluate the performance of the proposed method.
arXiv Detail & Related papers (2020-09-27T15:49:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.