Dual Relation Alignment for Composed Image Retrieval
- URL: http://arxiv.org/abs/2309.02169v3
- Date: Wed, 31 Jan 2024 06:18:14 GMT
- Title: Dual Relation Alignment for Composed Image Retrieval
- Authors: Xintong Jiang, Yaxiong Wang, Yujiao Wu, Meng Wang, Xueming Qian
- Abstract summary: We argue for the existence of two types of relations in composed image retrieval.
The explicit relation pertains to the reference image & complementary text-target image.
We propose a new framework for composed image retrieval, termed dual relation alignment.
- Score: 24.812654620141778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Composed image retrieval, a task involving the search for a target image
using a reference image and a complementary text as the query, has witnessed
significant advancements owing to the progress made in cross-modal modeling.
Unlike the general image-text retrieval problem with only one alignment
relation, i.e., image-text, we argue for the existence of two types of
relations in composed image retrieval. The explicit relation pertains to the
reference image & complementary text-target image, which is commonly exploited
by existing methods. Besides this intuitive relation, the observations during
our practice have uncovered another implicit yet crucial relation, i.e.,
reference image & target image-complementary text, since we found that the
complementary text can be inferred by studying the relation between the target
image and the reference image. Regrettably, existing methods largely focus on
leveraging the explicit relation to learn their networks, while overlooking the
implicit relation. In response to this weakness, We propose a new framework for
composed image retrieval, termed dual relation alignment, which integrates both
explicit and implicit relations to fully exploit the correlations among the
triplets. Specifically, we design a vision compositor to fuse reference image
and target image at first, then the resulted representation will serve two
roles: (1) counterpart for semantic alignment with the complementary text and
(2) compensation for the complementary text to boost the explicit relation
modeling, thereby implant the implicit relation into the alignment learning.
Our method is evaluated on two popular datasets, CIRR and FashionIQ, through
extensive experiments. The results confirm the effectiveness of our
dual-relation learning in substantially enhancing composed image retrieval
performance.
Related papers
- Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Hire: Hybrid-modal Interaction with Multiple Relational Enhancements for Image-Text Matching [7.7559623054251]
Image-text matching (ITM) is a fundamental problem in computer vision.
We propose a Hybrid-modal feature the Interaction with multiple Enhancements (termed textitHire) for image-text matching.
In particular, the explicit intra-modal spatial-semantic graph-based reasoning network is designed to improve the contextual representation of visual objects.
arXiv Detail & Related papers (2024-06-05T13:10:55Z) - CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval [15.45550686770835]
Composed Image Retrieval (CIR) involves searching for target images based on an image-text pair query.
We argue that CIR triplets contain additional associations beyond this primary relation.
In our paper, we identify two new relations within triplets, treating each triplet as a graph node.
arXiv Detail & Related papers (2024-05-29T14:52:10Z) - Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal
Contrastive Training [33.78990448307792]
Image-text retrieval is a central problem for understanding the semantic relationship between vision and language.
Previous works either simply learn coarse-grained representations of the overall image and text, or elaborately establish the correspondence between image regions or pixels and text words.
In this work, we address image-text retrieval from a novel perspective by combining coarse- and fine-grained representation learning into a unified framework.
arXiv Detail & Related papers (2023-06-15T00:19:13Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - Transformer-based Dual Relation Graph for Multi-label Image Recognition [56.12543717723385]
We propose a novel Transformer-based Dual Relation learning framework.
We explore two aspects of correlation, i.e., structural relation graph and semantic relation graph.
Our approach achieves new state-of-the-art on two popular multi-label recognition benchmarks.
arXiv Detail & Related papers (2021-10-10T07:14:52Z) - Cross-Modal Coherence for Text-to-Image Retrieval [35.82045187976062]
We train a Cross-Modal Coherence Modelfor text-to-image retrieval task.
Our analysis shows that models trained with image-text coherence relations can retrieve images originally paired with target text more often than coherence-agnostic models.
Our findings provide insights into the ways that different modalities communicate and the role of coherence relations in capturing commonsense inferences in text and imagery.
arXiv Detail & Related papers (2021-09-22T21:31:27Z) - Tensor Composition Net for Visual Relationship Prediction [115.14829858763399]
We present a novel Composition Network (TCN) to predict visual relationships in images.
The key idea of our TCN is to exploit the low rank property of the visual relationship tensor.
We show our TCN's image-level visual relationship prediction provides a simple and efficient mechanism for relation-based image retrieval.
arXiv Detail & Related papers (2020-12-10T06:27:20Z) - Consensus-Aware Visual-Semantic Embedding for Image-Text Matching [69.34076386926984]
Image-text matching plays a central role in bridging vision and language.
Most existing approaches only rely on the image-text instance pair to learn their representations.
We propose a Consensus-aware Visual-Semantic Embedding model to incorporate the consensus information.
arXiv Detail & Related papers (2020-07-17T10:22:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.