Scene Graph Based Fusion Network For Image-Text Retrieval
- URL: http://arxiv.org/abs/2303.11090v1
- Date: Mon, 20 Mar 2023 13:22:56 GMT
- Title: Scene Graph Based Fusion Network For Image-Text Retrieval
- Authors: Guoliang Wang, Yanlei Shang and Yong Chen
- Abstract summary: A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts.
We propose a Scene Graph based Fusion Network (dubbed SGFN) which enhances the images'/texts' features through intra- and cross-modal fusion.
Our SGFN performs better than quite a few SOTA image-text retrieval methods.
- Score: 2.962083552798791
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A critical challenge to image-text retrieval is how to learn accurate
correspondences between images and texts. Most existing methods mainly focus on
coarse-grained correspondences based on co-occurrences of semantic objects,
while failing to distinguish the fine-grained local correspondences. In this
paper, we propose a novel Scene Graph based Fusion Network (dubbed SGFN), which
enhances the images'/texts' features through intra- and cross-modal fusion for
image-text retrieval. To be specific, we design an intra-modal hierarchical
attention fusion to incorporate semantic contexts, such as objects, attributes,
and relationships, into images'/texts' feature vectors via scene graphs, and a
cross-modal attention fusion to combine the contextual semantics and local
fusion via contextual vectors. Extensive experiments on public datasets
Flickr30K and MSCOCO show that our SGFN performs better than quite a few SOTA
image-text retrieval methods.
Related papers
- TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding [91.30065932213758]
Large Multimodal Models (LMMs) have sparked a surge in research aimed at harnessing their remarkable reasoning abilities.
We propose TextCoT, a novel Chain-of-Thought framework for text-rich image understanding.
Our method is free of extra training, offering immediate plug-and-play functionality.
arXiv Detail & Related papers (2024-04-15T13:54:35Z) - Image Fusion via Vision-Language Model [91.36809431547128]
We introduce a novel fusion paradigm named image Fusion via vIsion-Language Model (FILM)
FILM generates semantic prompts from images and inputs them into ChatGPT for comprehensive textual descriptions.
These descriptions are fused within the textual domain and guide the visual information fusion.
FILM has shown promising results in four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion.
arXiv Detail & Related papers (2024-02-03T18:36:39Z) - TextFusion: Unveiling the Power of Textual Semantics for Controllable
Image Fusion [38.61215361212626]
We propose a text-guided fusion paradigm for advanced image fusion.
We release a text-annotated image fusion dataset IVT.
Our approach consistently outperforms traditional appearance-based fusion methods.
arXiv Detail & Related papers (2023-12-21T09:25:10Z) - Text-guided Image Restoration and Semantic Enhancement for Text-to-Image Person Retrieval [11.798006331912056]
The goal of Text-to-Image Person Retrieval (TIPR) is to retrieve specific person images according to the given textual descriptions.
We propose a novel TIPR framework to build fine-grained interactions and alignment between person images and the corresponding texts.
arXiv Detail & Related papers (2023-07-18T08:23:46Z) - HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval.
First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively.
Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module.
Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Scene Text Retrieval via Joint Text Detection and Similarity Learning [68.24531728554892]
Scene text retrieval aims to localize and search all text instances from an image gallery, which are the same or similar to a given query text.
We address this problem by directly learning a cross-modal similarity between a query text and each text instance from natural images.
In this way, scene text retrieval can be simply performed by ranking the detected text instances with the learned similarity.
arXiv Detail & Related papers (2021-04-04T07:18:38Z) - StacMR: Scene-Text Aware Cross-Modal Retrieval [19.54677614738065]
Cross-modal retrieval models have benefited from an increasingly rich understanding of visual scenes.
Current models overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval.
We propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances.
arXiv Detail & Related papers (2020-12-08T10:04:25Z) - Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image
Classification and Retrieval [8.317191999275536]
This paper focuses on leveraging multi-modal content in the form of visual and textual cues to tackle the task of fine-grained image classification and retrieval.
We employ a Graph Convolutional Network to perform multi-modal reasoning and obtain relationship-enhanced features by learning a common semantic space between salient objects and text found in an image.
arXiv Detail & Related papers (2020-09-21T12:31:42Z) - DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis [80.54273334640285]
We propose a novel one-stage text-to-image backbone that directly synthesizes high-resolution images without entanglements between different generators.
We also propose a novel Target-Aware Discriminator composed of Matching-Aware Gradient Penalty and One-Way Output.
Compared with current state-of-the-art methods, our proposed DF-GAN is simpler but more efficient to synthesize realistic and text-matching images.
arXiv Detail & Related papers (2020-08-13T12:51:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.