Related papers: Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

URL: http://arxiv.org/abs/2111.15263v1
Date: Tue, 30 Nov 2021 10:22:11 GMT
Title: Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features
Authors: Byeonghu Na, Yoonsik Kim, Sungrae Park
Abstract summary: This paper introduces a novel method, called Multi-Almod Text Recognition Network (MATRN) MATRN identifies visual and semantic feature pairs and encodes spatial information into semantic features. Our experiments demonstrate that MATRN achieves state-of-the-art performances on seven benchmarks with large margins.
Score: 11.48760300147023
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Linguistic knowledge has brought great benefits to scene text recognition by providing semantics to refine character sequences. However, since linguistic knowledge has been applied individually on the output sequence, previous methods have not fully utilized the semantics to understand visual clues for text recognition. This paper introduces a novel method, called Multi-modAl Text Recognition Network (MATRN), that enables interactions between visual and semantic features for better recognition performances. Specifically, MATRN identifies visual and semantic feature pairs and encodes spatial information into semantic features. Based on the spatial encoding, visual and semantic features are enhanced by referring to related features in the other modality. Furthermore, MATRN stimulates combining semantic features into visual features by hiding visual clues related to the character in the training phase. Our experiments demonstrate that MATRN achieves state-of-the-art performances on seven benchmarks with large margins, while naive combinations of two modalities show marginal improvements. Further ablative studies prove the effectiveness of our proposed components. Our implementation will be publicly available.

Related papers

Visual and Semantic Prompt Collaboration for Generalized Zero-Shot Learning [58.73625654718187]
Generalized zero-shot learning aims to recognize both seen and unseen classes with the help of semantic information that is shared among different classes. Existing approaches fine-tune the visual backbone by seen-class data to obtain semantic-related visual features. This paper proposes a novel visual and semantic prompt collaboration framework, which utilizes prompt tuning techniques for efficient feature adaptation.
arXiv Detail & Related papers (2025-03-29T10:17:57Z)
Embedding and Enriching Explicit Semantics for Visible-Infrared Person Re-Identification [31.011118085494942]
Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities. Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics. We propose an Embedding and Enriching Explicit Semantics framework to learn semantically rich cross-modality pedestrian representations.
arXiv Detail & Related papers (2024-12-11T14:27:30Z)
Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning [114.59476118365266]
We propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment. AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision.
arXiv Detail & Related papers (2024-06-05T07:59:48Z)
Dual Relation Mining Network for Zero-Shot Learning [48.89161627050706]
We propose a Dual Relation Mining Network (DRMN) to enable effective visual-semantic interactions and learn semantic relationship among attributes for knowledge transfer. Specifically, we introduce a Dual Attention Block (DAB) for visual-semantic relationship mining, which enriches visual information by multi-level feature fusion. For semantic relationship modeling, we utilize a Semantic Interaction Transformer (SIT) to enhance the generalization of attribute representations among images.
arXiv Detail & Related papers (2024-05-06T16:31:19Z)
PVLR: Prompt-driven Visual-Linguistic Representation Learning for Multi-Label Image Recognition [47.11517266162346]
We propose a Prompt-driven Visual-Linguistic Representation Learning framework to better leverage the capabilities of the linguistic modality. In contrast to the unidirectional fusion in previous works, we introduce a Dual-Modal Attention (DMA) that enables bidirectional interaction between textual and visual features.
arXiv Detail & Related papers (2024-01-31T14:39:11Z)
CMFN: Cross-Modal Fusion Network for Irregular Scene Text Recognition [22.13675752628]
We propose a novel cross-modal fusion network (CMFN) for irregular scene text recognition. CMFN consists of a position self-enhanced encoder, a visual recognition branch and an iterative semantic recognition branch. experiments demonstrate that the proposed CMFN algorithm achieves comparable performance to state-of-the-art algorithms.
arXiv Detail & Related papers (2024-01-18T15:05:57Z)
CLIP-Driven Semantic Discovery Network for Visible-Infrared Person Re-Identification [39.262536758248245]
Cross-modality identity matching poses significant challenges in VIReID. We propose a CLIP-Driven Semantic Discovery Network (CSDN) that consists of Modality-specific Prompt Learner, Semantic Information Integration, and High-level Semantic Embedding.
arXiv Detail & Related papers (2024-01-11T10:20:13Z)
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens [76.40196364163663]
We propose a learning-based vision-language pre-training approach, such as CLIP. We show that our method can learn more comprehensive representations and capture meaningful cross-modal correspondence.
arXiv Detail & Related papers (2023-03-27T00:58:39Z)
Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection. We propose to learn contextualized, joint representations through vision-language pre-training. The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z)
Learning Semantic-Aligned Feature Representation for Text-based Person Search [8.56017285139081]
We propose a semantic-aligned embedding method for text-based person search. The feature alignment across modalities is achieved by automatically learning the semantic-aligned visual features and textual features. Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.
arXiv Detail & Related papers (2021-12-13T14:54:38Z)
From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union. VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
Towards Accurate Scene Text Recognition with Semantic Reasoning Networks [52.86058031919856]
We propose a novel end-to-end trainable framework named semantic reasoning network (SRN) for accurate scene text recognition. GSRM is introduced to capture global semantic context through multi-way parallel transmission. Results on 7 public benchmarks, including regular text, irregular text and non-Latin long text, verify the effectiveness and robustness of the proposed method.
arXiv Detail & Related papers (2020-03-27T09:19:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.