FashionViL: Fashion-Focused Vision-and-Language Representation Learning
- URL: http://arxiv.org/abs/2207.08150v1
- Date: Sun, 17 Jul 2022 12:06:27 GMT
- Title: FashionViL: Fashion-Focused Vision-and-Language Representation Learning
- Authors: Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, Tao Xiang
- Abstract summary: We propose a novel fashion-focused Vision-and-Language (V+L) representation learning framework, dubbed as FashionViL.
It contains two novel fashion-specific pre-training tasks designed particularly to exploit two intrinsic attributes with fashion V+L data.
Extensive experiments show that our FashionViL achieves a new state of the art across five downstream tasks.
- Score: 129.49630356651454
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale Vision-and-Language (V+L) pre-training for representation
learning has proven to be effective in boosting various downstream V+L tasks.
However, when it comes to the fashion domain, existing V+L methods are
inadequate as they overlook the unique characteristics of both the fashion V+L
data and downstream tasks. In this work, we propose a novel fashion-focused V+L
representation learning framework, dubbed as FashionViL. It contains two novel
fashion-specific pre-training tasks designed particularly to exploit two
intrinsic attributes with fashion V+L data. First, in contrast to other domains
where a V+L data point contains only a single image-text pair, there could be
multiple images in the fashion domain. We thus propose a Multi-View Contrastive
Learning task for pulling closer the visual representation of one image to the
compositional multimodal representation of another image+text. Second, fashion
text (e.g., product description) often contains rich fine-grained concepts
(attributes/noun phrases). To exploit this, a Pseudo-Attributes Classification
task is introduced to encourage the learned unimodal (visual/textual)
representations of the same concept to be adjacent. Further, fashion V+L tasks
uniquely include ones that do not conform to the common one-stream or
two-stream architectures (e.g., text-guided image retrieval). We thus propose a
flexible, versatile V+L model architecture consisting of a modality-agnostic
Transformer so that it can be flexibly adapted to any downstream tasks.
Extensive experiments show that our FashionViL achieves a new state of the art
across five downstream tasks. Code is available at
https://github.com/BrandonHanx/mmf.
Related papers
- ViT-Lens: Towards Omni-modal Representations [64.66508684336614]
ViT-Lens-2 is a framework for representation learning of increasing modalities.
We show that ViT-Lens-2 can learn representations for 3D point cloud, depth, audio, tactile and EEG.
By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation.
arXiv Detail & Related papers (2023-11-27T18:52:09Z) - DeViL: Decoding Vision features into Language [53.88202366696955]
Post-hoc explanation methods have often been criticised for abstracting away the decision-making process of deep neural networks.
In this work, we would like to provide natural language descriptions for what different layers of a vision backbone have learned.
We train a transformer network to translate individual image features of any vision layer into a prompt that a separate off-the-shelf language model decodes into natural language.
arXiv Detail & Related papers (2023-09-04T13:59:55Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text
Pre-training [40.05046655477684]
ERNIE-ViL 2.0 is a Multi-View Contrastive learning framework to build intra-modal and inter-modal correlations between diverse views simultaneously.
We construct sequences of object tags as a special textual view to narrow the cross-modal semantic gap on noisy image-text pairs.
ERNIE-ViL 2.0 achieves competitive results on English cross-modal retrieval.
arXiv Detail & Related papers (2022-09-30T07:20:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.