LRVS-Fashion: Extending Visual Search with Referring Instructions
- URL: http://arxiv.org/abs/2306.02928v3
- Date: Wed, 15 May 2024 12:17:48 GMT
- Title: LRVS-Fashion: Extending Visual Search with Referring Instructions
- Authors: Simon Lepage, Jérémie Mary, David Picard,
- Abstract summary: We present Referred Visual Search (RVS), a task allowing users to define more precisely the desired similarity.
We release a new large public dataset, LRVS-Fashion, consisting of 272k fashion products with 842k images extracted from fashion catalogs.
- Score: 13.590668564555195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a new challenge for image similarity search in the context of fashion, addressing the inherent ambiguity in this domain stemming from complex images. We present Referred Visual Search (RVS), a task allowing users to define more precisely the desired similarity, following recent interest in the industry. We release a new large public dataset, LRVS-Fashion, consisting of 272k fashion products with 842k images extracted from fashion catalogs, designed explicitly for this task. However, unlike traditional visual search methods in the industry, we demonstrate that superior performance can be achieved by bypassing explicit object detection and adopting weakly-supervised conditional contrastive learning on image tuples. Our method is lightweight and demonstrates robustness, reaching Recall at one superior to strong detection-based baselines against 2M distractors. The dataset is available at https://huggingface.co/datasets/Slep/LAION-RVS-Fashion .
Related papers
- Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval [6.493562178111347]
We propose a cross-modal image-text retrieval framework based on object-aware query perturbation''
In our proposed method, object-aware cross-modal image-text retrieval is possible while keeping the rich expressive power and retrieval performance of existing V&L models without additional fine-tuning.
arXiv Detail & Related papers (2024-07-17T06:42:14Z) - Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects.
In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL)
A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z) - Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language.
We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z) - ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and
Implicit Similarity [16.550790981646276]
Current approaches combine the features of each of the two elements of the query into a single representation.
Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval.
arXiv Detail & Related papers (2022-03-15T17:29:20Z) - Contextual Similarity Aggregation with Self-attention for Visual
Re-ranking [96.55393026011811]
We propose a visual re-ranking method by contextual similarity aggregation with self-attention.
We conduct comprehensive experiments on four benchmark datasets to demonstrate the generality and effectiveness of our proposed visual re-ranking method.
arXiv Detail & Related papers (2021-10-26T06:20:31Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Region-level Active Learning for Cluttered Scenes [60.93811392293329]
We introduce a new strategy that subsumes previous Image-level and Object-level approaches into a generalized, Region-level approach.
We show that this approach significantly decreases labeling effort and improves rare object search on realistic data with inherent class-imbalance and cluttered scenes.
arXiv Detail & Related papers (2021-08-20T14:02:38Z) - Image Retrieval on Real-life Images with Pre-trained Vision-and-Language
Models [41.7254780975984]
We extend the task of composed image retrieval, where an input query consists of an image and short textual description of how to modify the image.
We propose CIRPLANT, a transformer based model that leverages rich pre-trained vision-and-language (V&L) knowledge for modifying visual features conditioned on natural language.
We demonstrate that with a relatively simple architecture, CIRPLANT outperforms existing methods on open-domain images, while matching state-of-the-art accuracy on the existing narrow datasets, such as fashion.
arXiv Detail & Related papers (2021-08-09T13:25:06Z) - Cross-Modal Retrieval Augmentation for Multi-Modal Classification [61.5253261560224]
We explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering.
First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement on image-caption retrieval.
Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines.
arXiv Detail & Related papers (2021-04-16T13:27:45Z) - Modality-Agnostic Attention Fusion for visual search with text feedback [5.650501970986438]
Our Modality-Agnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search datasets.
We introduce two new challenging benchmarks adapted from Birds-to-Words and Spot-the-Diff, which provide new settings with rich language inputs.
To better understand our model, we conduct detailed ablations on Fashion IQ and provide visualizations of the surprising phenomenon of words avoiding "attending" to the image region they refer to.
arXiv Detail & Related papers (2020-06-30T22:55:02Z) - A Strong Baseline for Fashion Retrieval with Person Re-Identification
Models [0.0]
Fashion retrieval is the challenging task of finding an exact match for fashion items contained within an image.
We introduce a simple baseline model for fashion retrieval, significantly outperforming previous state-of-the-art results.
We conduct in-depth experiments on Street2Shop and DeepFashion datasets and validate our results.
arXiv Detail & Related papers (2020-03-09T12:50:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.