Decompose Semantic Shifts for Composed Image Retrieval
- URL: http://arxiv.org/abs/2309.09531v1
- Date: Mon, 18 Sep 2023 07:21:30 GMT
- Title: Decompose Semantic Shifts for Composed Image Retrieval
- Authors: Xingyu Yang, Daqing Liu, Heng Zhang, Yong Luo, Chaoyue Wang, Jing
Zhang
- Abstract summary: Composed image retrieval is a type of image retrieval task where the user provides a reference image as a starting point and specifies a text on how to shift from the starting point to the desired target image.
We propose a Semantic Shift network (SSN) that explicitly decomposes the semantic shifts into two steps: from the reference image to the visual prototype and from the visual prototype to the target image.
The proposed SSN demonstrates a significant improvement of 5.42% and 1.37% on the CIRR and FashionIQ datasets, respectively, and establishes a new state-of-the-art performance.
- Score: 38.262678009072154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Composed image retrieval is a type of image retrieval task where the user
provides a reference image as a starting point and specifies a text on how to
shift from the starting point to the desired target image. However, most
existing methods focus on the composition learning of text and reference images
and oversimplify the text as a description, neglecting the inherent structure
and the user's shifting intention of the texts. As a result, these methods
typically take shortcuts that disregard the visual cue of the reference images.
To address this issue, we reconsider the text as instructions and propose a
Semantic Shift network (SSN) that explicitly decomposes the semantic shifts
into two steps: from the reference image to the visual prototype and from the
visual prototype to the target image. Specifically, SSN explicitly decomposes
the instructions into two components: degradation and upgradation, where the
degradation is used to picture the visual prototype from the reference image,
while the upgradation is used to enrich the visual prototype into the final
representations to retrieve the desired target image. The experimental results
show that the proposed SSN demonstrates a significant improvement of 5.42% and
1.37% on the CIRR and FashionIQ datasets, respectively, and establishes a new
state-of-the-art performance. Codes will be publicly available.
Related papers
- Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [27.95875467352853]
We propose a new referring remote sensing image segmentation method, FIANet, that fully exploits the visual and linguistic representations.
The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts.
We evaluate the effectiveness of the proposed methods on two public referring remote sensing datasets including RefSegRS and RRSIS-D.
arXiv Detail & Related papers (2024-09-20T16:45:32Z) - Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval [53.89454443114146]
We study the zero-shot Composed Image Retrieval (ZS-CIR) task, which is to retrieve the target image given a reference image and a description without training on the triplet datasets.
Previous works generate pseudo-word tokens by projecting the reference image features to the text embedding space.
We propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs)
KEDs implicitly models the attributes of the reference images by incorporating a database.
arXiv Detail & Related papers (2024-03-24T04:23:56Z) - Benchmarking Robustness of Text-Image Composed Retrieval [46.98557472744255]
Text-image composed retrieval aims to retrieve the target image through the composed query.
It has recently attracted attention due to its ability to leverage both information-rich images and concise language.
However, the robustness of these approaches against real-world corruptions or further text understanding has never been studied.
arXiv Detail & Related papers (2023-11-24T20:16:38Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Bi-directional Training for Composed Image Retrieval via Text Prompt
Learning [46.60334745348141]
Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text.
We propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures.
Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model.
arXiv Detail & Related papers (2023-03-29T11:37:41Z) - Semantic-Preserving Augmentation for Robust Image-Text Retrieval [27.2916415148638]
RVSE consists of novel image-based and text-based augmentation techniques called semantic preserving augmentation for image (SPAugI) and text (SPAugT)
Since SPAugI and SPAugT change the original data in a way that its semantic information is preserved, we enforce the feature extractors to generate semantic aware embedding vectors.
From extensive experiments using benchmark datasets, we show that RVSE outperforms conventional retrieval schemes in terms of image-text retrieval performance.
arXiv Detail & Related papers (2023-03-10T03:50:44Z) - DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level.
Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.