Training and challenging models for text-guided fashion image retrieval
- URL: http://arxiv.org/abs/2204.11004v1
- Date: Sat, 23 Apr 2022 06:24:23 GMT
- Title: Training and challenging models for text-guided fashion image retrieval
- Authors: Eric Dodds, Jack Culpepper, Gaurav Srivastava
- Abstract summary: We introduce a new evaluation dataset, Challenging Fashion Queries (CFQ)
CFQ complements existing benchmarks by including relative captions with positive and negative labels of caption accuracy and conditional image similarity.
We demonstrate the importance of multimodal pretraining for the task and show that domain-specific weak supervision based on attribute labels can augment generic large-scale pretraining.
- Score: 1.4266272677701561
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieving relevant images from a catalog based on a query image together
with a modifying caption is a challenging multimodal task that can particularly
benefit domains like apparel shopping, where fine details and subtle variations
may be best expressed through natural language. We introduce a new evaluation
dataset, Challenging Fashion Queries (CFQ), as well as a modeling approach that
achieves state-of-the-art performance on the existing Fashion IQ (FIQ) dataset.
CFQ complements existing benchmarks by including relative captions with
positive and negative labels of caption accuracy and conditional image
similarity, where others provided only positive labels with a combined meaning.
We demonstrate the importance of multimodal pretraining for the task and show
that domain-specific weak supervision based on attribute labels can augment
generic large-scale pretraining. While previous modality fusion mechanisms lose
the benefits of multimodal pretraining, we introduce a residual attention
fusion mechanism that improves performance. We release CFQ and our code to the
research community.
Related papers
- Improved Few-Shot Image Classification Through Multiple-Choice Questions [1.4432605069307167]
We propose a simple method to boost VQA performance for image classification using only a handful of labeled examples and a multiple-choice question.
We demonstrate this method outperforms both pure visual encoders and zero-shot VQA baselines to achieve impressive performance on common few-shot tasks.
arXiv Detail & Related papers (2024-07-23T03:09:42Z) - ATTIQA: Generalizable Image Quality Feature Extractor using Attribute-aware Pretraining [25.680035174334886]
In no-reference image quality assessment (NR-IQA), the challenge of limited dataset sizes hampers the development of robust and generalizable models.
We propose a novel pretraining framework that constructs a generalizable representation for IQA by selectively extracting quality-related knowledge.
Our approach achieves state-of-the-art performance on multiple IQA datasets and exhibits remarkable generalization capabilities.
arXiv Detail & Related papers (2024-06-03T06:03:57Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language Models [28.194638379354252]
We introduce a Depicted image Quality Assessment method (DepictQA), overcoming the constraints of traditional score-based methods.
DepictQA allows for detailed, language-based, human-like evaluation of image quality by leveraging Multi-modal Large Language Models.
These results showcase the research potential of multi-modal IQA methods.
arXiv Detail & Related papers (2023-12-14T14:10:02Z) - Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text.
Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information.
experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z) - Self-Supervised Open-Ended Classification with Small Visual Language
Models [60.23212389067007]
We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models.
By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe.
arXiv Detail & Related papers (2023-09-30T21:41:21Z) - Controllable Image Generation via Collage Representations [31.456445433105415]
"Mixing and matching scenes" (M&Ms) is an approach that consists of an adversarially trained generative image model conditioned on appearance features and spatial positions of the different elements in a collage.
We show that M&Ms outperforms baselines in terms of fine-grained scene controllability while being very competitive in terms of image quality and sample diversity.
arXiv Detail & Related papers (2023-04-26T17:58:39Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - Cross-Modal Retrieval Augmentation for Multi-Modal Classification [61.5253261560224]
We explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering.
First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement on image-caption retrieval.
Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines.
arXiv Detail & Related papers (2021-04-16T13:27:45Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.