Knowledge Mining with Scene Text for Fine-Grained Recognition
- URL: http://arxiv.org/abs/2203.14215v1
- Date: Sun, 27 Mar 2022 05:54:00 GMT
- Title: Knowledge Mining with Scene Text for Fine-Grained Recognition
- Authors: Hao Wang, Junchao Liao, Tianheng Cheng, Zewen Gao, Hao Liu, Bo Ren,
Xiang Bai, Wenyu Liu
- Abstract summary: We propose an end-to-end trainable network that mines implicit contextual knowledge behind scene text image.
We employ KnowBert to retrieve relevant knowledge for semantic representation and combine it with image features for fine-grained classification.
Our method outperforms the state-of-the-art by 3.72% mAP and 5.39% mAP, respectively.
- Score: 53.74297368412834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the semantics of scene text has been proven to be essential in
fine-grained image classification. However, the existing methods mainly exploit
the literal meaning of scene text for fine-grained recognition, which might be
irrelevant when it is not significantly related to objects/scenes. We propose
an end-to-end trainable network that mines implicit contextual knowledge behind
scene text image and enhance the semantics and correlation to fine-tune the
image representation. Unlike the existing methods, our model integrates three
modalities: visual feature extraction, text semantics extraction, and
correlating background knowledge to fine-grained image classification.
Specifically, we employ KnowBert to retrieve relevant knowledge for semantic
representation and combine it with image features for fine-grained
classification. Experiments on two benchmark datasets, Con-Text, and Drink
Bottle, show that our method outperforms the state-of-the-art by 3.72\% mAP and
5.39\% mAP, respectively. To further validate the effectiveness of the proposed
method, we create a new dataset on crowd activity recognition for the
evaluation. The source code and new dataset of this work are available at
https://github.com/lanfeng4659/KnowledgeMiningWithSceneText.
Related papers
- Choose What You Need: Disentangled Representation Learning for Scene Text Recognition, Removal and Editing [47.421888361871254]
Scene text images contain not only style information (font, background) but also content information (character, texture)
Previous representation learning methods use tightly coupled features for all tasks, resulting in sub-optimal performance.
We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability.
arXiv Detail & Related papers (2024-05-07T15:00:11Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification [59.779532652634295]
We propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs.
We parse objects and attributes from the description, which are highly likely to exist in the image.
Experiments substantiate the average 5.2% improvement of our framework over existing alternatives.
arXiv Detail & Related papers (2023-12-21T18:59:06Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - iCAR: Bridging Image Classification and Image-text Alignment for Visual
Recognition [33.2800417526215]
Image classification, which classifies images by pre-defined categories, has been the dominant approach to visual representation learning over the last decade.
Visual learning through image-text alignment, however, has emerged to show promising performance, especially for zero-shot recognition.
We propose a deep fusion method with three adaptations that effectively bridge two learning tasks.
arXiv Detail & Related papers (2022-04-22T15:27:21Z) - Fine-Grained Visual Entailment [51.66881737644983]
We propose an extension of this task, where the goal is to predict the logical relationship of fine-grained knowledge elements within a piece of text to an image.
Unlike prior work, our method is inherently explainable and makes logical predictions at different levels of granularity.
We evaluate our method on a new dataset of manually annotated knowledge elements and show that our method achieves 68.18% accuracy at this challenging task.
arXiv Detail & Related papers (2022-03-29T16:09:38Z) - Is An Image Worth Five Sentences? A New Look into Semantics for
Image-Text Matching [10.992151305603267]
We propose two metrics that evaluate the degree of semantic relevance of retrieved items, independently of their annotated binary relevance.
We incorporate a novel strategy that uses an image captioning metric, CIDEr, to define a Semantic Adaptive Margin (SAM) to be optimized in a standard triplet loss.
arXiv Detail & Related papers (2021-10-06T09:54:28Z) - Context-Aware Image Inpainting with Learned Semantic Priors [100.99543516733341]
We introduce pretext tasks that are semantically meaningful to estimating the missing contents.
We propose a context-aware image inpainting model, which adaptively integrates global semantics and local features.
arXiv Detail & Related papers (2021-06-14T08:09:43Z) - Text-Guided Neural Image Inpainting [20.551488941041256]
Inpainting task requires filling the corrupted image with contents coherent with the context.
The goal of this paper is to fill the semantic information in corrupted images according to the provided descriptive text.
We propose a novel inpainting model named Text-Guided Dual Attention Inpainting Network (TDANet)
arXiv Detail & Related papers (2020-04-07T09:04:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.