Visual-Semantic Transformer for Scene Text Recognition
- URL: http://arxiv.org/abs/2112.00948v1
- Date: Thu, 2 Dec 2021 02:59:56 GMT
- Title: Visual-Semantic Transformer for Scene Text Recognition
- Authors: Xin Tang and Yongquan Lai and Ying Liu and Yuanyuan Fu and Rui Fang
- Abstract summary: We propose to model semantic and visual information jointly with a Visual-Semantic Transformer (VST)
The VST first explicitly extracts primary semantic information from visual feature maps.
The semantic information is then joined with the visual feature maps to form a pseudo multi-domain sequence.
- Score: 5.323568551229187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modeling semantic information is helpful for scene text recognition. In this
work, we propose to model semantic and visual information jointly with a
Visual-Semantic Transformer (VST). The VST first explicitly extracts primary
semantic information from visual feature maps with a transformer module and a
primary visual-semantic alignment module. The semantic information is then
joined with the visual feature maps (viewed as a sequence) to form a pseudo
multi-domain sequence combining visual and semantic information, which is
subsequently fed into an transformer-based interaction module to enable
learning of interactions between visual and semantic features. In this way, the
visual features can be enhanced by the semantic information and vice versus.
The enhanced version of visual features are further decoded by a secondary
visual-semantic alignment module which shares weights with the primary one.
Finally, the decoded visual features and the enhanced semantic features are
jointly processed by the third transformer module obtaining the final text
prediction. Experiments on seven public benchmarks including regular/ irregular
text recognition datasets verifies the effectiveness our proposed model,
reaching state of the art on four of the seven benchmarks.
Related papers
- Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning [56.16593809016167]
We propose a novel Visual-Augmented Dynamic Semantic prototype method (termed VADS) to boost the generator to learn accurate semantic-visual mapping.
VADS consists of two modules: (1) Visual-aware Domain Knowledge Learning module (VDKL) learns the local bias and global prior of the visual features, which replace pure Gaussian noise to provide richer prior noise information; (2) Vision-Oriented Semantic Updation module (VOSU) updates the semantic prototype according to the visual representations of the samples.
arXiv Detail & Related papers (2024-04-23T07:39:09Z) - Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning [56.65891462413187]
We propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT)
ZSLViT first introduces semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement.
Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement.
arXiv Detail & Related papers (2024-04-11T12:59:38Z) - Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot
Learning [74.48337375174297]
Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain.
We deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between prototypes and visual features.
DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one.
arXiv Detail & Related papers (2023-03-27T15:21:43Z) - VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video
Paragraph Captioning [19.73126931526359]
Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling.
We first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements.
We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video.
arXiv Detail & Related papers (2022-11-28T07:39:20Z) - TransZero++: Cross Attribute-Guided Transformer for Zero-Shot Learning [119.43299939907685]
Zero-shot learning (ZSL) tackles the novel class recognition problem by transferring semantic knowledge from seen classes to unseen ones.
Existing attention-based models have struggled to learn inferior region features in a single image by solely using unidirectional attention.
We propose a cross attribute-guided Transformer network, termed TransZero++, to refine visual features and learn accurate attribute localization for semantic-augmented visual embedding representations.
arXiv Detail & Related papers (2021-12-16T05:49:51Z) - Learning Semantic-Aligned Feature Representation for Text-based Person
Search [8.56017285139081]
We propose a semantic-aligned embedding method for text-based person search.
The feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features.
Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.
arXiv Detail & Related papers (2021-12-13T14:54:38Z) - Multimodal Incremental Transformer with Visual Grounding for Visual
Dialogue Generation [25.57530524167637]
Visual dialogue needs to answer a series of coherent questions on the basis of understanding the visual environment.
Visual grounding aims to explicitly locate related objects in the image guided by textual entities.
multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response.
arXiv Detail & Related papers (2021-09-17T11:39:29Z) - Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model.
During the training phase, the modality transition network is optimised by the proposed modality loss.
Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z) - Semantic Disentangling Generalized Zero-Shot Learning [50.259058462272435]
Generalized Zero-Shot Learning (GZSL) aims to recognize images from both seen and unseen categories.
In this paper, we propose a novel feature disentangling approach based on an encoder-decoder architecture.
The proposed model aims to distill quality semantic-consistent representations that capture intrinsic features of seen images.
arXiv Detail & Related papers (2021-01-20T05:46:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.