Conversational Fashion Image Retrieval via Multiturn Natural Language
Feedback
- URL: http://arxiv.org/abs/2106.04128v1
- Date: Tue, 8 Jun 2021 06:34:25 GMT
- Title: Conversational Fashion Image Retrieval via Multiturn Natural Language
Feedback
- Authors: Yifei Yuan and Wai Lam
- Abstract summary: We study the task of conversational fashion image retrieval via multiturn natural language feedback.
We propose a novel framework that can effectively handle conversational fashion image retrieval with multiturn natural language feedback texts.
- Score: 36.623221002330226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the task of conversational fashion image retrieval via multiturn
natural language feedback. Most previous studies are based on single-turn
settings. Existing models on multiturn conversational fashion image retrieval
have limitations, such as employing traditional models, and leading to
ineffective performance. We propose a novel framework that can effectively
handle conversational fashion image retrieval with multiturn natural language
feedback texts. One characteristic of the framework is that it searches for
candidate images based on exploitation of the encoded reference image and
feedback text information together with the conversation history. Furthermore,
the image fashion attribute information is leveraged via a mutual attention
strategy. Since there is no existing fashion dataset suitable for the multiturn
setting of our task, we derive a large-scale multiturn fashion dataset via
additional manual annotation efforts on an existing single-turn dataset. The
experiments show that our proposed model significantly outperforms existing
state-of-the-art methods.
Related papers
- ENCLIP: Ensembling and Clustering-Based Contrastive Language-Image Pretraining for Fashion Multimodal Search with Limited Data and Low-Quality Images [1.534667887016089]
This paper presents an innovative approach called ENCLIP, for enhancing the performance of the Contrastive Language-Image Pretraining (CLIP) model.
It focuses on addressing the challenges posed by limited data availability and low-quality images.
arXiv Detail & Related papers (2024-11-25T05:15:38Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - A Multi-Modal Context Reasoning Approach for Conditional Inference on
Joint Textual and Visual Clues [23.743431157431893]
Conditional inference on joint textual and visual clues is a multi-modal reasoning task.
We propose a Multi-modal Context Reasoning approach, named ModCR.
We conduct extensive experiments on two corresponding data sets and experimental results show significantly improved performance.
arXiv Detail & Related papers (2023-05-08T08:05:40Z) - Grounding Language Models to Images for Multimodal Inputs and Outputs [89.30027812161686]
We propose an efficient method to ground pretrained text-only language models to the visual domain.
We process arbitrarily interleaved image-and-text data, and generate text interleaved with retrieved images.
arXiv Detail & Related papers (2023-01-31T18:33:44Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Modality-Agnostic Attention Fusion for visual search with text feedback [5.650501970986438]
Our Modality-Agnostic Attention Fusion (MAAF) model combines image and text features and outperforms existing approaches on two visual search datasets.
We introduce two new challenging benchmarks adapted from Birds-to-Words and Spot-the-Diff, which provide new settings with rich language inputs.
To better understand our model, we conduct detailed ablations on Fashion IQ and provide visualizations of the surprising phenomenon of words avoiding "attending" to the image region they refer to.
arXiv Detail & Related papers (2020-06-30T22:55:02Z) - FashionBERT: Text and Image Matching with Adaptive Loss for Cross-modal
Retrieval [31.822218310945036]
FashionBERT learns high level representations of texts and images.
FashionBERT achieves significant improvements in performances than the baseline and state-of-the-art approaches.
arXiv Detail & Related papers (2020-05-20T00:41:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.