Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval
- URL: http://arxiv.org/abs/2009.09809v1
- Date: Mon, 21 Sep 2020 12:31:42 GMT
- Title: Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval
- Authors: Andres Mafla, Sounak Dey, Ali Furkan Biten, Lluis Gomez and
Dimosthenis Karatzas
- Abstract summary: This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
- Score: 8.317191999275536
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene text instances found in natural images carry explicit semantic
information that can provide important cues to solve a wide array of computer
vision problems. In this paper, we focus on leveraging multi-modal content in
the form of visual and textual cues to tackle the task of fine-grained image
classification and retrieval. First, we obtain the text instances from images
by employing a text reading system. Then, we combine textual features with
salient image regions to exploit the complementary information carried by the
two sources. Specifically, we employ a Graph Convolutional Network to perform
multi-modal reasoning and obtain relationship-enhanced features by learning a
common semantic space between salient objects and text found in an image. By
obtaining an enhanced set of visual and textual features, the proposed model
greatly outperforms the previous state-of-the-art in two different tasks,
fine-grained classification and image retrieval in the Con-Text and Drink
Bottle datasets.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval [120.49126407479717]
We introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models.
Our system extends to novel applications in composed image retrieval, domain transfer, and fine-grained generation.
arXiv Detail & Related papers (2024-03-12T00:27:18Z) - Unleashing the Imagination of Text: A Novel Framework for Text-to-image
Person Retrieval via Exploring the Power of Words [0.951828574518325]
We propose a novel framework to explore the power of words in sentences.
The framework employs the pre-trained full CLIP model as a dual encoder for the images and texts.
We introduce a cross-modal triplet loss tailored for handling hard samples, enhancing the model's ability to distinguish minor differences.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Language Matters: A Weakly Supervised Pre-training Approach for Scene
Text Detection and Spotting [69.77701325270047]
This paper presents a weakly supervised pre-training method that can acquire effective scene text representations.
Our network consists of an image encoder and a character-aware text encoder that extract visual and textual features.
Experiments show that our pre-trained model improves F-score by +2.5% and +4.8% while transferring its weights to other text detection and spotting networks.
arXiv Detail & Related papers (2022-03-08T08:10:45Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Learning Multimodal Affinities for Textual Editing in Images [18.7418059568887]
We devise a generic unsupervised technique to learn multimodal affinities between textual entities in a document-image.
We then use these learned affinities to automatically cluster the textual entities in the image into different semantic groups.
We show that our technique can operate on highly varying images spanning a wide range of documents and demonstrate its applicability for various editing operations.
arXiv Detail & Related papers (2021-03-18T10:09:57Z) - VICTR: Visual Information Captured Text Representation for Text-to-Image
Multimodal Tasks [5.840117063192334]
We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input.
We train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks.
The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation.
arXiv Detail & Related papers (2020-10-07T05:25:30Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Fine-grained Image Classification and Retrieval by Combining Visual and
Locally Pooled Textual Features [8.317191999275536]
In particular, the mere presence of text provides strong guiding content that should be employed to tackle a diversity of computer vision tasks.
In this paper, we address the problem of fine-grained classification and image retrieval by leveraging textual information along with visual cues to comprehend the existing intrinsic relation between the two modalities.
arXiv Detail & Related papers (2020-01-14T12:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.