FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning
- URL: http://arxiv.org/abs/2210.15028v1
- Date: Wed, 26 Oct 2022 21:01:19 GMT
- Title: FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning
- Authors: Suvir Mirchandani, Licheng Yu, Mengjiao Wang, Animesh Sinha, Wenwen
Jiang, Tao Xiang, Ning Zhang
- Abstract summary: Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
- Score: 66.38951790650887
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal tasks in the fashion domain have significant potential for
e-commerce, but involve challenging vision-and-language learning problems -
e.g., retrieving a fashion item given a reference image plus text feedback from
a user. Prior works on multimodal fashion tasks have either been limited by the
data in individual benchmarks, or have leveraged generic vision-and-language
pre-training but have not taken advantage of the characteristics of fashion
data. Additionally, these works have mainly been restricted to multimodal
understanding tasks. To address these gaps, we make two key contributions.
First, we propose a novel fashion-specific pre-training framework based on
weakly-supervised triplets constructed from fashion image-text pairs. We show
the triplet-based tasks are an effective addition to standard multimodal
pre-training tasks. Second, we propose a flexible decoder-based model
architecture capable of both fashion retrieval and captioning tasks. Together,
our model design and pre-training approach are competitive on a diverse set of
fashion tasks, including cross-modal retrieval, image retrieval with text
feedback, image captioning, relative image captioning, and multimodal
categorization.
Related papers
- Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
Leopard is a vision-language model for handling vision-language tasks involving multiple text-rich images.
First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.
Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z) - VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval [10.603148564713518]
We present a new embedding model VISTA for universal multi-modal retrieval.
We introduce a flexible architecture which extends a powerful text encoder with the image understanding capability.
Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model.
arXiv Detail & Related papers (2024-06-06T17:37:47Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Towards Unifying Medical Vision-and-Language Pre-training via Soft
Prompts [63.84720380390935]
There exist two typical types, textiti.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used.
We propose an effective yet straightforward scheme named PTUnifier to unify the two types.
We first unify the input format by introducing visual and textual prompts, which serve as a feature bank that stores the most representative images/texts.
arXiv Detail & Related papers (2023-02-17T15:43:42Z) - ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text
Pre-training [40.05046655477684]
ERNIE-ViL 2.0 is a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously.
We construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs.
ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval.
arXiv Detail & Related papers (2022-09-30T07:20:07Z) - Image as a Foreign Language: BEiT Pretraining for All Vision and
Vision-Language Tasks [87.6494641931349]
We introduce a general-purpose multimodal foundation model BEiT-3.
It achieves state-of-the-art transfer performance on both vision and vision-language tasks.
arXiv Detail & Related papers (2022-08-22T16:55:04Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Conversational Fashion Image Retrieval via Multiturn Natural Language
Feedback [36.623221002330226]
We study the task of conversational fashion image retrieval via multiturn natural language feedback.
We propose a novel framework that can effectively handle conversational fashion image retrieval with multiturn natural language feedback texts.
arXiv Detail & Related papers (2021-06-08T06:34:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.