Decoupling Visual-Semantic Feature Learning for Robust Scene Text
Recognition
- URL: http://arxiv.org/abs/2111.12351v1
- Date: Wed, 24 Nov 2021 09:14:23 GMT
- Title: Decoupling Visual-Semantic Feature Learning for Robust Scene Text
Recognition
- Authors: Changxu Cheng, Bohan Li, Qi Zheng, Yongpan Wang, Wenyu Liu
- Abstract summary: We propose a novel Visual-Semantic Decoupling Network (VSDN) to address the problem.
Our VSDN contains a Visual Decoder (VD) and a Semantic Decoder (SD) to learn purer visual and semantic feature representation respectively.
Our method achieves state-of-the-art or competitive results on the standard benchmarks.
- Score: 32.012689511969604
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic information has been proved effective in scene text recognition.
Most existing methods tend to couple both visual and semantic information in an
attention-based decoder. As a result, the learning of semantic features is
prone to have a bias on the limited vocabulary of the training set, which is
called vocabulary reliance. In this paper, we propose a novel Visual-Semantic
Decoupling Network (VSDN) to address the problem. Our VSDN contains a Visual
Decoder (VD) and a Semantic Decoder (SD) to learn purer visual and semantic
feature representation respectively. Besides, a Semantic Encoder (SE) is
designed to match SD, which can be pre-trained together by additional
inexpensive large vocabulary via a simple word correction task. Thus the
semantic feature is more unbiased and precise to guide the visual feature
alignment and enrich the final character representation. Experiments show that
our method achieves state-of-the-art or competitive results on the standard
benchmarks, and outperforms the popular baseline by a large margin under
circumstances where the training set has a small size of vocabulary.
Related papers
- FILS: Self-Supervised Video Feature Prediction In Semantic Language Space [11.641926922266347]
This paper demonstrates a self-supervised approach for learning semantic video representations.
We present FILS, a novel self-supervised video Feature prediction In semantic Language Space.
arXiv Detail & Related papers (2024-06-05T16:44:06Z) - Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning [56.65891462413187]
We propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT)
ZSLViT first introduces semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement.
Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement.
arXiv Detail & Related papers (2024-04-11T12:59:38Z) - UniFine: A Unified and Fine-grained Approach for Zero-shot
Vision-Language Understanding [84.83494254263138]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.
Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z) - Scene Text Recognition with Image-Text Matching-guided Dictionary [17.073688809336456]
We propose a new dictionary language model leveraging the Scene Image-Text Matching(SITM) network.
Inspired by ITC, the SITM network combines the visual features and the text features of all candidates to identify the candidate with the minimum distance in the feature space.
Our lexicon method achieves better results(93.8% accuracy) than the ordinary method results(92.1% accuracy) on six mainstream benchmarks.
arXiv Detail & Related papers (2023-05-08T07:47:49Z) - Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection
to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR)
Existing methods rely on separate pre-training feature extractors for visual and textual understanding.
We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z) - Visual-Semantic Contrastive Alignment for Few-Shot Image Classification [1.109560166867076]
Few-Shot learning aims to train a model that can adapt to unseen visual classes with only a few labeled examples.
We introduce a contrastive alignment mechanism for visual and semantic feature vectors to learn much more generalized visual concepts.
Our method simply adds an auxiliary contrastive learning objective which captures the contextual knowledge of a visual category.
arXiv Detail & Related papers (2022-10-20T03:59:40Z) - MSDN: Mutually Semantic Distillation Network for Zero-Shot Learning [28.330268557106912]
Key challenge of zero-shot learning (ZSL) is how to infer the latent semantic knowledge between visual and attribute features on seen classes.
We propose a Mutually Semantic Distillation Network (MSDN), which progressively distills the intrinsic semantic representations between visual and attribute features.
arXiv Detail & Related papers (2022-03-07T05:27:08Z) - From Two to One: A New Scene Text Recognizer with Visual Language
Modeling Network [70.47504933083218]
We propose a Visual Language Modeling Network (VisionLAN), which views the visual and linguistic information as a union.
VisionLAN significantly improves the speed by 39% and adaptively considers the linguistic information to enhance the visual features for accurate recognition.
arXiv Detail & Related papers (2021-08-22T07:56:24Z) - Meta-Learning with Variational Semantic Memory for Word Sense
Disambiguation [56.830395467247016]
We propose a model of semantic memory for WSD in a meta-learning setting.
Our model is based on hierarchical variational inference and incorporates an adaptive memory update rule via a hypernetwork.
We show our model advances the state of the art in few-shot WSD, supports effective learning in extremely data scarce scenarios.
arXiv Detail & Related papers (2021-06-05T20:40:01Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.