Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval
- URL: http://arxiv.org/abs/2304.03391v1
- Date: Thu, 6 Apr 2023 21:45:46 GMT
- Title: Exposing and Mitigating Spurious Correlations for Cross-Modal Retrieval
- Authors: Jae Myung Kim, A. Sophia Koepke, Cordelia Schmid, Zeynep Akata
- Abstract summary: Cross-modal retrieval methods are the preferred tool to search databases for the text that best matches a query image and vice versa.
Image-text retrieval models commonly learn to spurious correlations in the training data, such as frequent object co-occurrence.
We introduce ODmAP@k, an object decorrelation metric that measures a model's robustness to spurious correlations in the training data.
- Score: 89.30660533051514
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal retrieval methods are the preferred tool to search databases for
the text that best matches a query image and vice versa. However, image-text
retrieval models commonly learn to memorize spurious correlations in the
training data, such as frequent object co-occurrence, instead of looking at the
actual underlying reasons for the prediction in the image. For image-text
retrieval, this manifests in retrieved sentences that mention objects that are
not present in the query image. In this work, we introduce ODmAP@k, an object
decorrelation metric that measures a model's robustness to spurious
correlations in the training data. We use automatic image and text
manipulations to control the presence of such object correlations in designated
test data. Additionally, our data synthesis technique is used to tackle model
biases due to spurious correlations of semantically unrelated objects in the
training data. We apply our proposed pipeline, which involves the finetuning of
image-text retrieval frameworks on carefully designed synthetic data, to three
state-of-the-art models for image-text retrieval. This results in significant
improvements for all three models, both in terms of the standard retrieval
performance and in terms of our object decorrelation metric. The code is
available at https://github.com/ExplainableML/Spurious_CM_Retrieval.
Related papers
- Nearest Neighbor Normalization Improves Multimodal Retrieval [30.076028359751614]
We present a method for correcting errors in trained contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN)
NNN requires a reference database, but does not require any training on this database, and can even increase the retrieval accuracy of a model after finetuning.
arXiv Detail & Related papers (2024-10-31T16:44:10Z) - Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Spuriousness-Aware Meta-Learning for Learning Robust Classifiers [26.544938760265136]
Spurious correlations are brittle associations between certain attributes of inputs and target variables.
Deep image classifiers often leverage them for predictions, leading to poor generalization on the data where the correlations do not hold.
Mitigating the impact of spurious correlations is crucial towards robust model generalization, but it often requires annotations of the spurious correlations in data.
arXiv Detail & Related papers (2024-06-15T21:41:25Z) - Adapting Dual-encoder Vision-language Models for Paraphrased Retrieval [55.90407811819347]
We consider the task of paraphrased text-to-image retrieval where a model aims to return similar results given a pair of paraphrased queries.
We train a dual-encoder model starting from a language model pretrained on a large text corpus.
Compared to public dual-encoder models such as CLIP and OpenCLIP, the model trained with our best adaptation strategy achieves a significantly higher ranking similarity for paraphrased queries.
arXiv Detail & Related papers (2024-05-06T06:30:17Z) - Relation Rectification in Diffusion Model [64.84686527988809]
We introduce a novel task termed Relation Rectification, aiming to refine the model to accurately represent a given relationship it initially fails to generate.
We propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN)
The lightweight HGCN adjusts the text embeddings generated by the text encoder, ensuring the accurate reflection of the textual relation in the embedding space.
arXiv Detail & Related papers (2024-03-29T15:54:36Z) - Fusing Local Similarities for Retrieval-based 3D Orientation Estimation
of Unseen Objects [70.49392581592089]
We tackle the task of estimating the 3D orientation of previously-unseen objects from monocular images.
We follow a retrieval-based strategy and prevent the network from learning object-specific features.
Our experiments on the LineMOD, LineMOD-Occluded, and T-LESS datasets show that our method yields a significantly better generalization to unseen objects than previous works.
arXiv Detail & Related papers (2022-03-16T08:53:00Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.