Pure Transformer with Integrated Experts for Scene Text Recognition
- URL: http://arxiv.org/abs/2211.04963v1
- Date: Wed, 9 Nov 2022 15:26:59 GMT
- Title: Pure Transformer with Integrated Experts for Scene Text Recognition
- Authors: Yew Lee Tan, Adams Wai-kin Kong, Jung-Jae Kim
- Abstract summary: Scene text recognition (STR) involves the task of reading text in cropped images of natural scenes.
Recent times, the transformer architecture is being widely adopted in STR as it shows strong capability in capturing long-term dependency.
This work proposes the use of a transformer-only model as a simple baseline which outperforms hybrid CNN-transformer models.
- Score: 11.089203218000854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text recognition (STR) involves the task of reading text in cropped
images of natural scenes. Conventional models in STR employ convolutional
neural network (CNN) followed by recurrent neural network in an encoder-decoder
framework. In recent times, the transformer architecture is being widely
adopted in STR as it shows strong capability in capturing long-term dependency
which appears to be prominent in scene text images. Many researchers utilized
transformer as part of a hybrid CNN-transformer encoder, often followed by a
transformer decoder. However, such methods only make use of the long-term
dependency mid-way through the encoding process. Although the vision
transformer (ViT) is able to capture such dependency at an early stage, its
utilization remains largely unexploited in STR. This work proposes the use of a
transformer-only model as a simple baseline which outperforms hybrid
CNN-transformer models. Furthermore, two key areas for improvement were
identified. Firstly, the first decoded character has the lowest prediction
accuracy. Secondly, images of different original aspect ratios react
differently to the patch resolutions while ViT only employ one fixed patch
resolution. To explore these areas, Pure Transformer with Integrated Experts
(PTIE) is proposed. PTIE is a transformer model that can process multiple patch
resolutions and decode in both the original and reverse character orders. It is
examined on 7 commonly used benchmarks and compared with over 20
state-of-the-art methods. The experimental results show that the proposed
method outperforms them and obtains state-of-the-art results in most
benchmarks.
Related papers
- Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy
for Image Recognition without Convolutions [1.1032962642000486]
This work is based on Vision Transformer, combined with the pyramid architecture, using Split-merge-transform to propose the group encoder and name the network architecture Aggregated Pyramid Vision Transformer (APVT)
We perform image classification tasks on the CIFAR-10 dataset and object detection tasks on the COCO 2017 dataset.
arXiv Detail & Related papers (2022-03-02T09:14:28Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Fully Transformer Networks for Semantic ImageSegmentation [26.037770622551882]
We explore a novel framework for semantic image segmentation, which is encoder-decoder based Fully Transformer Networks (FTN)
We propose a Pyramid Group Transformer (PGT) as the encoder for progressively learning hierarchical features, while reducing the computation complexity of the standard visual transformer(ViT)
Then, we propose a Feature Pyramid Transformer (FPT) to fuse semantic-level and spatial-level information from multiple levels of the PGT encoder for semantic image segmentation.
arXiv Detail & Related papers (2021-06-08T05:15:28Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Swin-Unet: Unet-like Pure Transformer for Medical Image Segmentation [63.46694853953092]
Swin-Unet is an Unet-like pure Transformer for medical image segmentation.
tokenized image patches are fed into the Transformer-based U-shaped decoder-Decoder architecture.
arXiv Detail & Related papers (2021-05-12T09:30:26Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.