HTR-VT: Handwritten Text Recognition with Vision Transformer
- URL: http://arxiv.org/abs/2409.08573v1
- Date: Fri, 13 Sep 2024 06:46:23 GMT
- Title: HTR-VT: Handwritten Text Recognition with Vision Transformer
- Authors: Yuting Li, Dexiong Chen, Tinglong Tang, Xi Shen,
- Abstract summary: We explore the application of Vision Transformer (ViT) for handwritten text recognition.
Previous transformer-based models required external data or extensive pre-training on large datasets to excel.
We find that incorporating a ConAwareal Neural Network (CNN) for feature extraction instead of the original patch embedding and employ Sharpness Minimization (SAM) encoder ensures that the model can converge towards flatter minima.
- Score: 7.997204893256558
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore the application of Vision Transformer (ViT) for handwritten text recognition. The limited availability of labeled data in this domain poses challenges for achieving high performance solely relying on ViT. Previous transformer-based models required external data or extensive pre-training on large datasets to excel. To address this limitation, we introduce a data-efficient ViT method that uses only the encoder of the standard transformer. We find that incorporating a Convolutional Neural Network (CNN) for feature extraction instead of the original patch embedding and employ Sharpness-Aware Minimization (SAM) optimizer to ensure that the model can converge towards flatter minima and yield notable enhancements. Furthermore, our introduction of the span mask technique, which masks interconnected features in the feature map, acts as an effective regularizer. Empirically, our approach competes favorably with traditional CNN-based models on small datasets like IAM and READ2016. Additionally, it establishes a new benchmark on the LAM dataset, currently the largest dataset with 19,830 training text lines. The code is publicly available at: https://github.com/YutingLi0606/HTR-VT.
Related papers
- HSViT: Horizontally Scalable Vision Transformer [16.46308352393693]
Vision Transformer (ViT) requires pre-training on large-scale datasets to perform well.
This paper introduces a novel horizontally scalable vision transformer (HSViT) scheme.
HSViT achieves up to 10% higher top-1 accuracy than state-of-the-art schemes on small datasets.
arXiv Detail & Related papers (2024-04-08T04:53:29Z) - TransAxx: Efficient Transformers with Approximate Computing [4.347898144642257]
Vision Transformer (ViT) models have shown to be very competitive and often become a popular alternative to Convolutional Neural Networks (CNNs)
We propose TransAxx, a framework based on the popular PyTorch library that enables fast inherent support for approximate arithmetic.
Our approach uses a Monte Carlo Tree Search (MCTS) algorithm to efficiently search the space of possible configurations.
arXiv Detail & Related papers (2024-02-12T10:16:05Z) - VST++: Efficient and Stronger Visual Saliency Transformer [74.26078624363274]
We develop an efficient and stronger VST++ model to explore global long-range dependencies.
We evaluate our model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets.
arXiv Detail & Related papers (2023-10-18T05:44:49Z) - Exploring Efficient Few-shot Adaptation for Vision Transformers [70.91692521825405]
We propose a novel efficient Transformer Tuning (eTT) method that facilitates finetuning ViTs in the Few-shot Learning tasks.
Key novelties come from the newly presented Attentive Prefix Tuning (APT) and Domain Residual Adapter (DRA)
We conduct extensive experiments to show the efficacy of our model.
arXiv Detail & Related papers (2023-01-06T08:42:05Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - VLDeformer: Learning Visual-Semantic Embeddings by Vision-Language
Transformer Decomposing [7.890230091463883]
Vision-language transformers (VL transformers) have shown impressive accuracy in cross-modal retrieval.
We propose a novel Vision-language Transformer Decomposing (VLDeformer) to modify the VL transformer as an individual encoder for a single image or text.
arXiv Detail & Related papers (2021-10-20T09:00:51Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - T-VSE: Transformer-Based Visual Semantic Embedding [5.317624228510748]
We show that transformer-based cross-modal embeddings outperform word average and RNN-based embeddings by a large margin, when trained on a large dataset of e-commerce product image-title pairs.
arXiv Detail & Related papers (2020-05-17T23:40:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.