Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval
- URL: http://arxiv.org/abs/2507.05970v2
- Date: Wed, 06 Aug 2025 07:41:51 GMT
- Title: Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval
- Authors: Haiwen Li, Delong Liu, Zhaohui Hou, Zhicheng Zhao, Fei Su,
- Abstract summary: Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries.<n>We propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS)
- Score: 19.520776313567737
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.
Related papers
- Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval [52.709090256954276]
Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a compositional query.<n>We propose a novel framework by employing a Multimodal Reasoning Agent (MRA) for ZS-CIR.
arXiv Detail & Related papers (2025-05-26T13:17:50Z) - Scaling Prompt Instructed Zero Shot Composed Image Retrieval with Image-Only Data [39.17652541259225]
Composed Image Retrieval (CIR) is the task of retrieving images matching a reference image augmented with a text.<n>We introduce an embedding reformulation architecture that effectively combines image and text modalities.<n>Our model, named InstructCIR, outperforms state-of-the-art methods in zero-shot composed image retrieval on CIRR and FashionIQ datasets.
arXiv Detail & Related papers (2025-04-01T14:03:46Z) - CoLLM: A Large Language Model for Composed Image Retrieval [76.29725148964368]
Composed Image Retrieval (CIR) is a complex task that aims to retrieve images based on a multimodal query.<n>We present CoLLM, a one-stop framework that generates triplets on-the-fly from image-caption pairs.<n>We leverage Large Language Models (LLMs) to generate joint embeddings of reference images and modification texts.
arXiv Detail & Related papers (2025-03-25T17:59:50Z) - Triplet Synthesis For Enhancing Composed Image Retrieval via Counterfactual Image Generation [38.091197064787565]
Composed Image Retrieval (CIR) provides an effective way to manage and access large-scale visual data.<n>This paper proposes a novel triplet synthesis method by leveraging counterfactual image generation.
arXiv Detail & Related papers (2025-01-22T07:18:46Z) - Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval [50.72924579220149]
Composed Image Retrieval (CIR) is a task that retrieves images similar to a query, based on a provided textual modification.
Current techniques rely on supervised learning for CIR models using labeled triplets of the reference image, text, target image.
We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data.
arXiv Detail & Related papers (2024-04-23T21:00:22Z) - Vision-by-Language for Training-Free Compositional Image Retrieval [78.60509831598745]
Compositional Image Retrieval (CIR) aims to retrieve the relevant target image in a database.
Recent research sidesteps this need by using large-scale vision-language models (VLMs)
We propose to tackle CIR in a training-free manner via Vision-by-Language (CIReVL)
arXiv Detail & Related papers (2023-10-13T17:59:38Z) - CoVR-2: Automatic Data Construction for Composed Video Retrieval [59.854331104466254]
Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together.
We propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs.
We also expand the scope of the task to include composed video retrieval (CoVR)
arXiv Detail & Related papers (2023-08-28T17:55:33Z) - Zero-shot Composed Text-Image Retrieval [72.43790281036584]
We consider the problem of composed image retrieval (CIR)
It aims to train a model that can fuse multi-modal information, e.g., text and images, to accurately retrieve images that match the query, extending the user's expression ability.
arXiv Detail & Related papers (2023-06-12T17:56:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.