Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching
- URL: http://arxiv.org/abs/2406.18579v1
- Date: Wed, 5 Jun 2024 13:10:55 GMT
- Title: Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching
- Authors: Xuri Ge, Fuhai Chen, Songpei Xu, Fuxiang Tao, Jie Wang, Joemon M. Jose,
- Abstract summary: Image-text matching (ITM) is a fundamental problem in computer vision.
We propose a Hybrid-modal feature the Interaction with multiple Enhancements (termed textitHire) for image-text matching.
In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation of visual objects.
- Score: 7.7559623054251
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text matching (ITM) is a fundamental problem in computer vision. The key issue lies in jointly learning the visual and textual representation to estimate their similarity accurately. Most existing methods focus on feature enhancement within modality or feature interaction across modalities, which, however, neglects the contextual information of the object representation based on the inter-object relationships that match the corresponding sentences with rich contextual semantics. In this paper, we propose a Hybrid-modal Interaction with multiple Relational Enhancements (termed \textit{Hire}) for image-text matching, which correlates the intra- and inter-modal semantics between objects and words with implicit and explicit relationship modelling. In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation of visual objects with salient spatial and semantic relational connectivities, guided by the explicit relationships of the objects' spatial positions and their scene graph. We use implicit relationship modelling for potential relationship interactions before explicit modelling to improve the fault tolerance of explicit relationship detection. Then the visual and textual semantic representations are refined jointly via inter-modal interactive attention and cross-modal alignment. To correlate the context of objects with the textual context, we further refine the visual semantic representation via cross-level object-sentence and word-image-based interactive attention. Extensive experiments validate that the proposed hybrid-modal interaction with implicit and explicit modelling is more beneficial for image-text matching. And the proposed \textit{Hire} obtains new state-of-the-art results on MS-COCO and Flickr30K benchmarks.
Related papers
- SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph
Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer.
Our method replaces original language-independent encoding with cross-modal encoding in visual analysis.
Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z) - Dual Relation Alignment for Composed Image Retrieval [24.812654620141778]
We argue for the existence of two types of relations in composed image retrieval.
The explicit relation pertains to the reference image & complementary text-target image.
We propose a new framework for composed image retrieval, termed dual relation alignment.
arXiv Detail & Related papers (2023-09-05T12:16:14Z) - Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image.
We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
arXiv Detail & Related papers (2023-05-25T15:26:13Z) - Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image
Person Retrieval [29.884153827619915]
We present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework.
It learns relations between local visual-textual tokens and enhances global image-text matching.
The proposed method achieves new state-of-the-art results on all three public datasets.
arXiv Detail & Related papers (2023-03-22T12:11:59Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Cross-modal Semantic Enhanced Interaction for Image-Sentence Retrieval [8.855547063009828]
We propose a Cross-modal Semantic Enhanced Interaction method, termed CMSEI for image-sentence retrieval.
We first design the intra- and inter-modal spatial and semantic graphs based reasoning to enhance the semantic representations of objects.
To correlate the context of objects with the textual context, we further refine the visual semantic representation via the cross-level object-sentence and word-image based interactive attention.
arXiv Detail & Related papers (2022-10-17T10:01:16Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Cross-Modal Coherence for Text-to-Image Retrieval [35.82045187976062]
We train a Cross-Modal Coherence Modelfor text-to-image retrieval task.
Our analysis shows that models trained with image-text coherence relations can retrieve images originally paired with target text more often than coherence-agnostic models.
Our findings provide insights into the ways that different modalities communicate and the role of coherence relations in capturing commonsense inferences in text and imagery.
arXiv Detail & Related papers (2021-09-22T21:31:27Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.