Multi-modal Text Recognition Networks: Interactive Enhancements between
Visual and Semantic Features
- URL: http://arxiv.org/abs/2111.15263v1
- Date: Tue, 30 Nov 2021 10:22:11 GMT
- Title: Multi-modal Text Recognition Networks: Interactive Enhancements between
Visual and Semantic Features
- Authors: Byeonghu Na, Yoonsik Kim, Sungrae Park
- Abstract summary: This paper introduces a novel method, called Multi-Almod Text Recognition Network (MATRN)
MATRN identifies visual and semantic feature pairs and encodes spatial information into semantic features.
Our experiments demonstrate that MATRN achieves state-of-the-art performances on seven benchmarks with large margins.
- Score: 11.48760300147023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Linguistic knowledge has brought great benefits to scene text recognition by
providing semantics to refine character sequences. However, since linguistic
knowledge has been applied individually on the output sequence, previous
methods have not fully utilized the semantics to understand visual clues for
text recognition. This paper introduces a novel method, called Multi-modAl Text
Recognition Network (MATRN), that enables interactions between visual and
semantic features for better recognition performances. Specifically, MATRN
identifies visual and semantic feature pairs and encodes spatial information
into semantic features. Based on the spatial encoding, visual and semantic
features are enhanced by referring to related features in the other modality.
Furthermore, MATRN stimulates combining semantic features into visual features
by hiding visual clues related to the character in the training phase. Our
experiments demonstrate that MATRN achieves state-of-the-art performances on
seven benchmarks with large margins, while naive combinations of two modalities
show marginal improvements. Further ablative studies prove the effectiveness of
our proposed components. Our implementation will be publicly available.
Related papers
- Dual Relation Mining Network for Zero-Shot Learning [48.89161627050706]
We propose a Dual Relation Mining Network (DRMN) to enable effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer.
Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion.
For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images.
arXiv Detail & Related papers (2024-05-06T16:31:19Z) - PVLR: Prompt-driven Visual-Linguistic Representation Learning for
Multi-Label Image Recognition [47.11517266162346]
We propose a Prompt-driven Visual-Linguistic Representation Learning framework to better leverage the capabilities of the linguistic modality.
In contrast to the unidirectional fusion in previous works, we introduce a Dual-Modal Attention (DMA) that enables bidirectional interaction between textual and visual features.
arXiv Detail & Related papers (2024-01-31T14:39:11Z) - CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition [22.13675752628]
We propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition.
CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch.
experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms.
arXiv Detail & Related papers (2024-01-18T15:05:57Z) - CLIP-Driven Semantic Discovery Network for Visible-Infrared Person
Re-Identification [39.262536758248245]
Cross-modality identity matching poses significant challenges in VIReID.
We propose a CLIP-Driven Semantic Discovery Network (CSDN) that consists of Modality-specific Prompt Learner, Semantic Information Integration, and High-level Semantic Embedding.
arXiv Detail & Related papers (2024-01-11T10:20:13Z) - Revisiting Multimodal Representation in Contrastive Learning: From Patch
and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP.
We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Learning Semantic-Aligned Feature Representation for Text-based Person
Search [8.56017285139081]
We propose a semantic-aligned embedding method for text-based person search.
The feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features.
Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.
arXiv Detail & Related papers (2021-12-13T14:54:38Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Towards Accurate Scene Text Recognition with Semantic Reasoning Networks [52.86058031919856]
We propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition.
GSRM is introduced to capture global semantic context through multi-way parallel transmission.
Results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method.
arXiv Detail & Related papers (2020-03-27T09:19:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.