Related papers: Decoupling Visual-Semantic Feature Learning for Robust Scene Text Recognition

Decoupling Visual-Semantic Feature Learning for Robust Scene Text Recognition

URL: http://arxiv.org/abs/2111.12351v1
Date: Wed, 24 Nov 2021 09:14:23 GMT
Title: Decoupling Visual-Semantic Feature Learning for Robust Scene Text Recognition
Authors: Changxu Cheng, Bohan Li, Qi Zheng, Yongpan Wang, Wenyu Liu
Abstract summary: We propose a novel Visual-Semantic Decoupling Network (VSDN) to address the problem. Our VSDN contains a Visual Decoder (VD) and a Semantic Decoder (SD) to learn purer visual and semantic feature representation respectively. Our method achieves state-of-the-art or competitive results on the standard benchmarks.
Score: 32.012689511969604
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Semantic information has been proved effective in scene text recognition. Most existing methods tend to couple both visual and semantic information in an attention-based decoder. As a result, the learning of semantic features is prone to have a bias on the limited vocabulary of the training set, which is called vocabulary reliance. In this paper, we propose a novel Visual-Semantic Decoupling Network (VSDN) to address the problem. Our VSDN contains a Visual Decoder (VD) and a Semantic Decoder (SD) to learn purer visual and semantic feature representation respectively. Besides, a Semantic Encoder (SE) is designed to match SD, which can be pre-trained together by additional inexpensive large vocabulary via a simple word correction task. Thus the semantic feature is more unbiased and precise to guide the visual feature alignment and enrich the final character representation. Experiments show that our method achieves state-of-the-art or competitive results on the standard benchmarks, and outperforms the popular baseline by a large margin under circumstances where the training set has a small size of vocabulary.

Related papers

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space [11.641926922266347]
This paper demonstrates a self-supervised approach for learning semantic video representations. We present FILS, a novel self-supervised video Feature prediction In semantic Language Space.
arXiv Detail & Related papers (2024-06-05T16:44:06Z)
Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning [56.65891462413187]
We propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT) ZSLViT first introduces semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement. Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement.
arXiv Detail & Related papers (2024-04-11T12:59:38Z)
UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning. Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z)
Scene Text Recognition with Image-Text Matching-guided Dictionary [17.073688809336456]
We propose a new dictionary language model leveraging the Scene Image-Text Matching(SITM) network. Inspired by ITC, the SITM network combines the visual features and the text features of all candidates to identify the candidate with the minimum distance in the feature space. Our lexicon method achieves better results(93.8% accuracy) than the ordinary method results(92.1% accuracy) on six mainstream benchmarks.
arXiv Detail & Related papers (2023-05-08T07:47:49Z)
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR) Existing methods rely on separate pre-training feature extractors for visual and textual understanding. We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z)
Visual-Semantic Contrastive Alignment for Few-Shot Image Classification [1.109560166867076]
Few-Shot learning aims to train a model that can adapt to unseen visual classes with only a few labeled examples. We introduce a contrastive alignment mechanism for visual and semantic feature vectors to learn much more generalized visual concepts. Our method simply adds an auxiliary contrastive learning objective which captures the contextual knowledge of a visual category.
arXiv Detail & Related papers (2022-10-20T03:59:40Z)
MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning [28.330268557106912]
Key challenge of zero-shot learning (ZSL) is how to infer the latent semantic knowledge between visual and attribute features on seen classes. We propose a Mutually Semantic Distillation Network (MSDN), which progressively distills the intrinsic semantic representations between visual and attribute features.
arXiv Detail & Related papers (2022-03-07T05:27:08Z)
CLIP-Adapter: Better Vision-Language Models with Feature Adapters [84.88106370842883]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning. CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
From Two to One: A New Scene Text Recognizer with Visual Language Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union. VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z)
Meta-Learning with Variational Semantic Memory for Word Sense Disambiguation [56.830395467247016]
We propose a model of semantic memory for WSD in a meta-learning setting. Our model is based on hierarchical variational inference and incorporates an adaptive memory update rule via a hypernetwork. We show our model advances the state of the art in few-shot WSD, supports effective learning in extremely data scarce scenarios.
arXiv Detail & Related papers (2021-06-05T20:40:01Z)
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.