Bi-directional Training for Composed Image Retrieval via Text Prompt
Learning
- URL: http://arxiv.org/abs/2303.16604v2
- Date: Sun, 5 Nov 2023 11:47:05 GMT
- Title: Bi-directional Training for Composed Image Retrieval via Text Prompt
Learning
- Authors: Zheyuan Liu, Weixuan Sun, Yicong Hong, Damien Teney, Stephen Gould
- Abstract summary: Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text.
We propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures.
Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model.
- Score: 46.60334745348141
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Composed image retrieval searches for a target image based on a multi-modal
user query comprised of a reference image and modification text describing the
desired changes. Existing approaches to solving this challenging task learn a
mapping from the (reference image, modification text)-pair to an image
embedding that is then matched against a large image corpus. One area that has
not yet been explored is the reverse direction, which asks the question, what
reference image when modified as described by the text would produce the given
target image? In this work we propose a bi-directional training scheme that
leverages such reversed queries and can be applied to existing composed image
retrieval architectures with minimum changes, which improves the performance of
the model. To encode the bi-directional query we prepend a learnable token to
the modification text that designates the direction of the query and then
finetune the parameters of the text embedding module. We make no other changes
to the network architecture. Experiments on two standard datasets show that our
novel approach achieves improved performance over a baseline BLIP-based model
that itself already achieves competitive performance. Our code is released at
https://github.com/Cuberick-Orion/Bi-Blip4CIR.
Related papers
- Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Beyond One-to-One: Rethinking the Referring Image Segmentation [117.53010476628029]
Referring image segmentation aims to segment the target object referred by a natural language expression.
We propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches.
In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target.
Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature.
arXiv Detail & Related papers (2023-08-26T11:39:22Z) - Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - BOSS: Bottom-up Cross-modal Semantic Composition with Hybrid
Counterfactual Training for Robust Content-based Image Retrieval [61.803481264081036]
Content-Based Image Retrieval (CIR) aims to search for a target image by concurrently comprehending the composition of an example image and a complementary text.
We tackle this task by a novel underlinetextbfBottom-up crunderlinetextbfOss-modal underlinetextbfSemantic compounderlinetextbfSition (textbfBOSS) with Hybrid Counterfactual Training framework.
arXiv Detail & Related papers (2022-07-09T07:14:44Z) - ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and
Implicit Similarity [16.550790981646276]
Current approaches combine the features of each of the two elements of the query into a single representation.
Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval.
arXiv Detail & Related papers (2022-03-15T17:29:20Z) - RTIC: Residual Learning for Text and Image Composition using Graph
Convolutional Network [19.017377597937617]
We study the compositional learning of images and texts for image retrieval.
We introduce a novel method that combines the graph convolutional network (GCN) with existing composition methods.
arXiv Detail & Related papers (2021-04-07T09:41:52Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.