Learning with Multi-modal Gradient Attention for Explainable Composed
Image Retrieval
- URL: http://arxiv.org/abs/2308.16649v1
- Date: Thu, 31 Aug 2023 11:46:27 GMT
- Title: Learning with Multi-modal Gradient Attention for Explainable Composed
Image Retrieval
- Authors: Prateksha Udhayanan, Srikrishna Karanam, and Balaji Vasan Srinivasan
- Abstract summary: We propose a new gradient-attention-based learning objective that explicitly forces the model to focus on the local regions of interest being modified in each retrieval step.
We show how MMGrad can be incorporated into an end-to-end model training strategy with a new learning objective that explicitly forces these MMGrad attention maps to highlight the correct local regions corresponding to the modifier text.
- Score: 15.24270990274781
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of composed image retrieval that takes an input query
consisting of an image and a modification text indicating the desired changes
to be made on the image and retrieves images that match these changes. Current
state-of-the-art techniques that address this problem use global features for
the retrieval, resulting in incorrect localization of the regions of interest
to be modified because of the global nature of the features, more so in cases
of real-world, in-the-wild images. Since modifier texts usually correspond to
specific local changes in an image, it is critical that models learn local
features to be able to both localize and retrieve better. To this end, our key
novelty is a new gradient-attention-based learning objective that explicitly
forces the model to focus on the local regions of interest being modified in
each retrieval step. We achieve this by first proposing a new visual image
attention computation technique, which we call multi-modal gradient attention
(MMGrad) that is explicitly conditioned on the modifier text. We next
demonstrate how MMGrad can be incorporated into an end-to-end model training
strategy with a new learning objective that explicitly forces these MMGrad
attention maps to highlight the correct local regions corresponding to the
modifier text. By training retrieval models with this new loss function, we
show improved grounding by means of better visual attention maps, leading to
better explainability of the models as well as competitive quantitative
retrieval performance on standard benchmark datasets.
Related papers
- Context-Based Visual-Language Place Recognition [4.737519767218666]
A popular approach to vision-based place recognition relies on low-level visual features.
We introduce a novel VPR approach that remains robust to scene changes and does not require additional training.
Our method constructs semantic image descriptors by extracting pixel-level embeddings using a zero-shot, language-driven semantic segmentation model.
arXiv Detail & Related papers (2024-10-25T06:59:11Z) - Question-Answer Cross Language Image Matching for Weakly Supervised
Semantic Segmentation [37.15828464616587]
Class Activation Map (CAM) has emerged as a popular tool for weakly supervised semantic segmentation.
We propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS)
arXiv Detail & Related papers (2024-01-18T10:55:13Z) - LIME: Localized Image Editing via Attention Regularization in Diffusion
Models [74.3811832586391]
This paper introduces LIME for localized image editing in diffusion models that do not require user-specified regions of interest (RoI) or additional text input.
Our method employs features from pre-trained methods and a simple clustering technique to obtain precise semantic segmentation maps.
We propose a novel cross-attention regularization technique that penalizes unrelated cross-attention scores in the RoI during the denoising steps, ensuring localized edits.
arXiv Detail & Related papers (2023-12-14T18:59:59Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z) - Situational Perception Guided Image Matting [16.1897179939677]
We propose a Situational Perception Guided Image Matting (SPG-IM) method that mitigates subjective bias of matting annotations.
SPG-IM can better associate inter-objects and object-to-environment saliency, and compensate the subjective nature of image matting.
arXiv Detail & Related papers (2022-04-20T07:35:51Z) - Attribute Prototype Network for Any-Shot Learning [113.50220968583353]
We argue that an image representation with integrated attribute localization ability would be beneficial for any-shot, i.e. zero-shot and few-shot, image classification tasks.
We propose a novel representation learning framework that jointly learns global and local features using only class-level attributes.
arXiv Detail & Related papers (2022-04-04T02:25:40Z) - Region-level Active Learning for Cluttered Scenes [60.93811392293329]
We introduce a new strategy that subsumes previous Image-level and Object-level approaches into a generalized, Region-level approach.
We show that this approach significantly decreases labeling effort and improves rare object search on realistic data with inherent class-imbalance and cluttered scenes.
arXiv Detail & Related papers (2021-08-20T14:02:38Z) - Cross-Descriptor Visual Localization and Mapping [81.16435356103133]
Visual localization and mapping is the key technology underlying the majority of Mixed Reality and robotics systems.
We present three novel scenarios for localization and mapping which require the continuous update of feature representations.
Our data-driven approach is agnostic to the feature descriptor type, has low computational requirements, and scales linearly with the number of description algorithms.
arXiv Detail & Related papers (2020-12-02T18:19:51Z) - Look here! A parametric learning based approach to redirect visual
attention [49.609412873346386]
We introduce an automatic method to make an image region more attention-capturing via subtle image edits.
Our model predicts a distinct set of global parametric transformations to be applied to the foreground and background image regions.
Our edits enable inference at interactive rates on any image size, and easily generalize to videos.
arXiv Detail & Related papers (2020-08-12T16:08:36Z) - Unifying Deep Local and Global Features for Image Search [9.614694312155798]
We unify global and local image features into a single deep model, enabling accurate retrieval with efficient feature extraction.
Our model achieves state-of-the-art image retrieval on the Revisited Oxford and Paris datasets, and state-of-the-art single-model instance-level recognition on the Google Landmarks dataset v2.
arXiv Detail & Related papers (2020-01-14T19:59:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.