Scene Text Recognition with Image-Text Matching-guided Dictionary
- URL: http://arxiv.org/abs/2305.04524v1
- Date: Mon, 8 May 2023 07:47:49 GMT
- Title: Scene Text Recognition with Image-Text Matching-guided Dictionary
- Authors: Jiajun Wei, Hongjian Zhan, Xiao Tu, Yue Lu, and Umapada Pal
- Abstract summary: We propose a new dictionary language model leveraging the Scene Image-Text Matching(SITM) network.
Inspired by ITC, the SITM network combines the visual features and the text features of all candidates to identify the candidate with the minimum distance in the feature space.
Our lexicon method achieves better results(93.8% accuracy) than the ordinary method results(92.1% accuracy) on six mainstream benchmarks.
- Score: 17.073688809336456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Employing a dictionary can efficiently rectify the deviation between the
visual prediction and the ground truth in scene text recognition methods.
However, the independence of the dictionary on the visual features may lead to
incorrect rectification of accurate visual predictions. In this paper, we
propose a new dictionary language model leveraging the Scene Image-Text
Matching(SITM) network, which avoids the drawbacks of the explicit dictionary
language model: 1) the independence of the visual features; 2) noisy choice in
candidates etc. The SITM network accomplishes this by using Image-Text
Contrastive (ITC) Learning to match an image with its corresponding text among
candidates in the inference stage. ITC is widely used in vision-language
learning to pull the positive image-text pair closer in feature space. Inspired
by ITC, the SITM network combines the visual features and the text features of
all candidates to identify the candidate with the minimum distance in the
feature space. Our lexicon method achieves better results(93.8\% accuracy) than
the ordinary method results(92.1\% accuracy) on six mainstream benchmarks.
Additionally, we integrate our method with ABINet and establish new
state-of-the-art results on several benchmarks.
Related papers
- Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - SILC: Improving Vision Language Pretraining with Self-Distillation [113.50400246862056]
We introduce SILC, a novel framework for vision language pretraining.
SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation.
We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation.
arXiv Detail & Related papers (2023-10-20T08:44:47Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Text-based Person Search in Full Images via Semantic-Driven Proposal
Generation [42.25611020956918]
We propose a new end-to-end learning framework which jointly optimize the pedestrian detection, identification and visual-semantic feature embedding tasks.
To take full advantage of the query text, the semantic features are leveraged to instruct the Region Proposal Network to pay more attention to the text-described proposals.
arXiv Detail & Related papers (2021-09-27T11:42:40Z) - Primitive Representation Learning for Scene Text Recognition [7.818765015637802]
We propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images.
A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding.
We also propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods.
arXiv Detail & Related papers (2021-05-10T11:54:49Z) - Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language
Representation Learning [31.895442072646254]
"See Out of tHe bOx" takes a whole image as input and learns vision-language representation in an end-to-end manner.
Soho achieves absolute gains of 2.0% R@1 score on MSCOCO text retrieval 5k test split, 1.5% accuracy on NLVR$2$ test-P split, 6.7% accuracy on SNLI-VE test split.
arXiv Detail & Related papers (2021-04-07T14:07:20Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Catching Out-of-Context Misinformation with Self-supervised Learning [2.435006380732194]
We propose a new method that automatically detects out-of-context image and text pairs.
Our core idea is a self-supervised training strategy where we only need images with matching captions from different sources.
Our method achieves 82% out-of-context detection accuracy.
arXiv Detail & Related papers (2021-01-15T19:00:42Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously.
The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.