Zero-shot Composed Text-Image Retrieval
- URL: http://arxiv.org/abs/2306.07272v2
- Date: Wed, 6 Mar 2024 07:16:06 GMT
- Title: Zero-shot Composed Text-Image Retrieval
- Authors: Yikun Liu and Jiangchao Yao and Ya Zhang and Yanfeng Wang and Weidi
Xie
- Abstract summary: We consider the problem of composed image retrieval (CIR)
It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
- Score: 72.43790281036584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider the problem of composed image retrieval (CIR), it
aims to train a model that can fuse multi-modal information, e.g., text and
images, to accurately retrieve images that match the query, extending the
user's expression ability. We make the following contributions: (i) we initiate
a scalable pipeline to automatically construct datasets for training CIR model,
by simply exploiting a large-scale dataset of image-text pairs, e.g., a subset
of LAION-5B; (ii) we introduce a transformer-based adaptive aggregation model,
TransAgg, which employs a simple yet efficient fusion mechanism, to adaptively
combine information from diverse modalities; (iii) we conduct extensive
ablation studies to investigate the usefulness of our proposed data
construction procedure, and the effectiveness of core components in TransAgg;
(iv) when evaluating on the publicly available benckmarks under the zero-shot
scenario, i.e., training on the automatically constructed datasets, then
directly conduct inference on target downstream datasets, e.g., CIRR and
FashionIQ, our proposed approach either performs on par with or significantly
outperforms the existing state-of-the-art (SOTA) models. Project page:
https://code-kunkun.github.io/ZS-CIR/
Related papers
- Spherical Linear Interpolation and Text-Anchoring for Zero-shot Composed Image Retrieval [43.47770490199544]
Composed Image Retrieval (CIR) is a complex task that retrieves images using a query, which is configured with an image and a caption.
We introduce a novel ZS-CIR method that uses Spherical Linear Interpolation (Slerp) to directly merge image and text representations.
We also introduce Text-Anchored-Tuning (TAT), a method that fine-tunes the image encoder while keeping the text encoder fixed.
arXiv Detail & Related papers (2024-05-01T15:19:54Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval [92.13664084464514]
The task of composed image retrieval (CIR) aims to retrieve images based on the query image and the text describing the users' intent.
Existing methods have made great progress with the advanced large vision-language (VL) model in CIR task, however, they generally suffer from two main issues: lack of labeled triplets for model training and difficulty of deployment on resource-restricted environments.
We propose Image2Sentence based Asymmetric zero-shot composed image retrieval (ISA), which takes advantage of the VL model and only relies on unlabeled images for composition learning.
arXiv Detail & Related papers (2024-03-03T07:58:03Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - Data Roaming and Quality Assessment for Composed Image Retrieval [25.452015862927766]
Composed Image Retrieval (CoIR) involves queries that combine image and text modalities, allowing users to express their intent more effectively.
We introduce the Large Scale Composed Image Retrieval (LaSCo) dataset, a new CoIR dataset which is ten times larger than existing ones.
We also introduce a new CoIR baseline, the Cross-Attention driven Shift (CASE)
arXiv Detail & Related papers (2023-03-16T16:02:24Z) - MOGAN: Morphologic-structure-aware Generative Learning from a Single
Image [59.59698650663925]
Recently proposed generative models complete training based on only one image.
We introduce a MOrphologic-structure-aware Generative Adversarial Network named MOGAN that produces random samples with diverse appearances.
Our approach focuses on internal features including the maintenance of rational structures and variation on appearance.
arXiv Detail & Related papers (2021-03-04T12:45:23Z) - MosAIc: Finding Artistic Connections across Culture with Conditional
Image Retrieval [27.549695661396274]
We introduce Conditional Image Retrieval (CIR) which combines visual similarity search with user supplied filters or "conditions"
CIR allows one to find pairs of similar images that span distinct subsets of the image corpus.
We show that our CIR data-structures can identify "blind spots" in Generative Adversarial Networks (GAN) where they fail to properly model the true data distribution.
arXiv Detail & Related papers (2020-07-14T16:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.