Related papers: A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical Document Image Enhancement

A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical Document Image Enhancement

URL: http://arxiv.org/abs/2312.03946v1
Date: Wed, 6 Dec 2023 23:01:11 GMT
Title: A Layer-Wise Tokens-to-Token Transformer Network for Improved Historical Document Image Enhancement
Authors: Risab Biswas, Swalpa Kumar Roy, Umapada Pal
Abstract summary: We propose textbfT2T-BinFormer which is a novel document binarization encoder-decoder architecture based on a Tokens-to-token vision transformer. Experiments on various DIBCO and H-DIBCO benchmarks demonstrate that the proposed model outperforms the existing CNN and ViT-based state-of-the-art methods.
Score: 13.27528507177775
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Document image enhancement is a fundamental and important stage for attaining the best performance in any document analysis assignment because there are many degradation situations that could harm document images, making it more difficult to recognize and analyze them. In this paper, we propose \textbf{T2T-BinFormer} which is a novel document binarization encoder-decoder architecture based on a Tokens-to-token vision transformer. Each image is divided into a set of tokens with a defined length using the ViT model, which is then applied several times to model the global relationship between the tokens. However, the conventional tokenization of input data does not adequately reflect the crucial local structure between adjacent pixels of the input image, which results in low efficiency. Instead of using a simple ViT and hard splitting of images for the document image enhancement task, we employed a progressive tokenization technique to capture this local information from an image to achieve more effective results. Experiments on various DIBCO and H-DIBCO benchmarks demonstrate that the proposed model outperforms the existing CNN and ViT-based state-of-the-art methods. In this research, the primary area of examination is the application of the proposed architecture to the task of document binarization. The source code will be made available at https://github.com/RisabBiswas/T2T-BinFormer.

Related papers

High-Resolution Image Synthesis via Next-Token Prediction [13.131691198804127]
We introduce D-JEPA$cdot$T2I, an extension of D-JEPA incorporating flow matching loss to enable continuous resolution learning. For the first time, we achieve state-of-the-art textbfhigh-resolution image synthesis via next-token prediction.
arXiv Detail & Related papers (2024-11-22T09:08:58Z)
UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model. We show that UNIT significantly outperforms existing methods on document-related tasks. Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z)
Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally. This neglects their different informativeness and leads to a significant increase in the number of image tokens. We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z)
DocBinFormer: A Two-Level Transformer Network for Effective Document Image Binarization [17.087982099845156]
Document binarization is a fundamental and crucial step for achieving the most optimal performance in any document analysis task. We propose DocBinFormer, a novel two-level vision transformer (TL-ViT) architecture based on vision transformers for effective document image binarization.
arXiv Detail & Related papers (2023-12-06T16:01:29Z)
Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts. Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities. We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z)
DocMAE: Document Image Rectification via Self-supervised Representation Learning [144.44748607192147]
We present DocMAE, a novel self-supervised framework for document image rectification. We first mask random patches of the background-excluded document images and then reconstruct the missing pixels. With such a self-supervised learning approach, the network is encouraged to learn the intrinsic structure of deformed documents.
arXiv Detail & Related papers (2023-04-20T14:27:15Z)
Deep Unrestricted Document Image Rectification [110.61517455253308]
We present DocTr++, a novel unified framework for document image rectification. We upgrade the original architecture by adopting a hierarchical encoder-decoder structure for multi-scale representation extraction and parsing. We contribute a real-world test set and metrics applicable for evaluating the rectification quality.
arXiv Detail & Related papers (2023-04-18T08:00:54Z)
StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training [64.37272287179661]
StrucTexTv2 is an effective document image pre-training framework. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction.
arXiv Detail & Related papers (2023-03-01T07:32:51Z)
DocEnTr: An End-to-End Document Image Enhancement Transformer [13.108797370734893]
Document images can be affected by many degradation scenarios, which cause recognition and processing difficulties. We present a new encoder-decoder architecture based on vision transformers to enhance both machine-printed and handwritten document images.
arXiv Detail & Related papers (2022-01-25T11:45:35Z)
RTIC: Residual Learning for Text and Image Composition using Graph Convolutional Network [19.017377597937617]
We study the compositional learning of images and texts for image retrieval. We introduce a novel method that combines the graph convolutional network (GCN) with existing composition methods.
arXiv Detail & Related papers (2021-04-07T09:41:52Z)
Two-stage generative adversarial networks for document image binarization with color noise and background removal [7.639067237772286]
We propose a two-stage color document image enhancement and binarization method using generative adversarial neural networks. In the first stage, four color-independent adversarial networks are trained to extract color foreground information from an input image. In the second stage, two independent adversarial networks with global and local features are trained for image binarization of documents of variable size.
arXiv Detail & Related papers (2020-10-20T07:51:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.