FACap: A Large-scale Fashion Dataset for Fine-grained Composed Image Retrieval
- URL: http://arxiv.org/abs/2507.07135v1
- Date: Tue, 08 Jul 2025 23:02:10 GMT
- Title: FACap: A Large-scale Fashion Dataset for Fine-grained Composed Image Retrieval
- Authors: François Gardères, Shizhe Chen, Camille-Sovanneary Gauthier, Jean Ponce,
- Abstract summary: FACap is a large-scale, automatically constructed fashion-domain CIR dataset.<n>FashionBLIP-2 fine-tunes the general-domain BLIP-2 model on FACap with lightweight adapters.<n>FashionBLIP-2 is evaluated with and without additional fine-tuning on the Fashion IQ benchmark.
- Score: 40.19988037304243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The composed image retrieval (CIR) task is to retrieve target images given a reference image and a modification text. Recent methods for CIR leverage large pretrained vision-language models (VLMs) and achieve good performance on general-domain concepts like color and texture. However, they still struggle with application domains like fashion, because the rich and diverse vocabulary used in fashion requires specific fine-grained vision and language understanding. An additional difficulty is the lack of large-scale fashion datasets with detailed and relevant annotations, due to the expensive cost of manual annotation by specialists. To address these challenges, we introduce FACap, a large-scale, automatically constructed fashion-domain CIR dataset. It leverages web-sourced fashion images and a two-stage annotation pipeline powered by a VLM and a large language model (LLM) to generate accurate and detailed modification texts. Then, we propose a new CIR model FashionBLIP-2, which fine-tunes the general-domain BLIP-2 model on FACap with lightweight adapters and multi-head query-candidate matching to better account for fine-grained fashion-specific information. FashionBLIP-2 is evaluated with and without additional fine-tuning on the Fashion IQ benchmark and the enhanced evaluation dataset enhFashionIQ, leveraging our pipeline to obtain higher-quality annotations. Experimental results show that the combination of FashionBLIP-2 and pretraining with FACap significantly improves the model's performance in fashion CIR especially for retrieval with fine-grained modification texts, demonstrating the value of our dataset and approach in a highly demanding environment such as e-commerce websites. Code is available at https://fgxaos.github.io/facap-paper-website/.
Related papers
- Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval [23.472806734625774]
We propose Dual prompt Learning with Joint Category-Attribute Reweighting (DCAR) to achieve precise image-text matching.<n>Based on the prompt paradigm, DCAR jointly optimize attribute and class features to enhance fine-grained representation learning.
arXiv Detail & Related papers (2025-08-06T02:44:08Z) - Trust the Model: Compact VLMs as In-Context Judges for Image-Text Data Quality [5.750869893508341]
Vision-language models (VLMs) extend the conventional large language models by integrating visual data, enabling richer multimodal reasoning.<n>We introduce a streamlined data filtration framework that employs a compact VLM, fine-tuned on a high-quality image-caption annotated dataset.<n>This model effectively evaluates and filters potential training samples based on caption and image quality and alignment.
arXiv Detail & Related papers (2025-07-27T07:20:25Z) - good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval [10.156187875858995]
Composed image retrieval (CIR) enables users to search images using a reference image combined with textual modifications.<n>We introduce good4cir, a structured pipeline leveraging vision-language models to generate high-quality synthetic annotations.<n>Results demonstrate improved retrieval accuracy for CIR models trained on our pipeline-generated datasets.
arXiv Detail & Related papers (2025-03-22T22:33:56Z) - ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval [83.01358520910533]
We introduce a new framework that can boost the performance of large-scale pre-trained vision- curation models.<n>The approach, Enhanced Language-Image Pre-training (ELIP), uses the text query, via a simple mapping network, to predict a set of visual prompts.<n>ELIP can easily be applied to the commonly used CLIP, SigLIP and BLIP-2 networks.
arXiv Detail & Related papers (2025-02-21T18:59:57Z) - Dressing the Imagination: A Dataset for AI-Powered Translation of Text into Fashion Outfits and A Novel KAN Adapter for Enhanced Feature Adaptation [2.3010373219231495]
We present FLORA, the first comprehensive dataset containing 4,330 curated pairs of fashion outfits and corresponding textual descriptions.<n>As a second contribution, we introduce KAN Adapters, which leverage Kolmogorov-Arnold Networks (KAN) as adaptive modules.<n>To foster further research and collaboration, we will open-source both the FLORA and our implementation code.
arXiv Detail & Related papers (2024-11-21T07:27:45Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - Fashionformer: A simple, Effective and Unified Baseline for Human
Fashion Segmentation and Recognition [80.74495836502919]
In this work, we focus on joint human fashion segmentation and attribute recognition.
We introduce the object query for segmentation and the attribute query for attribute prediction.
For attribute stream, we design a novel Multi-Layer Rendering module to explore more fine-grained features.
arXiv Detail & Related papers (2022-04-10T11:11:10Z) - Multimodal Quasi-AutoRegression: Forecasting the visual popularity of
new fashion products [18.753508811614644]
Trend detection in fashion is a challenging task due to the fast pace of change in the fashion industry.
We propose MuQAR, a multi-modal multi-layer perceptron processing categorical and visual features extracted by computer vision networks.
A comparative study on the VISUELLE dataset, shows that MuQAR is capable of competing and surpassing the domain's current state of the art by 2.88% in terms of WAPE and 3.04% in terms of MAE.
arXiv Detail & Related papers (2022-04-08T11:53:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.