Masked Vision-Language Transformer in Fashion
- URL: http://arxiv.org/abs/2210.15110v1
- Date: Thu, 27 Oct 2022 01:44:08 GMT
- Title: Masked Vision-Language Transformer in Fashion
- Authors: Ge-Peng Ji, Mingcheng Zhuge, Dehong Gao, Deng-Ping Fan, Christos
Sakaridis, Luc Van Gool
- Abstract summary: Masked vision-language transformer (MVLT) for fashion-specific multi-modal representation.
MVLT is an and convenient architecture that admits raw multi-modal inputs without extra pre-processing models.
More importantly, MVLT can easily generalize to various matching and generative tasks.
- Score: 85.6143169850834
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a masked vision-language transformer (MVLT) for fashion-specific
multi-modal representation. Technically, we simply utilize vision transformer
architecture for replacing the BERT in the pre-training model, making MVLT the
first end-to-end framework for the fashion domain. Besides, we designed masked
image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is
an extensible and convenient architecture that admits raw multi-modal inputs
without extra pre-processing models (e.g., ResNet), implicitly modeling the
vision-language alignments. More importantly, MVLT can easily generalize to
various matching and generative tasks. Experimental results show obvious
improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks
over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at
https://github.com/GewelsJI/MVLT.
Related papers
- Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models [92.36510016591782]
We present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs)
Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens.
Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models.
arXiv Detail & Related papers (2024-08-19T17:48:11Z) - VisionLLaMA: A Unified LLaMA Backbone for Vision Tasks [60.22144823791902]
We unveil a LLaMA-like vision transformer in plain and pyramid forms, termed VisionLLaMA, which is tailored for this purpose.
VisionLLaMA is a unified and generic modelling framework for solving most vision tasks.
arXiv Detail & Related papers (2024-03-01T13:30:51Z) - Holistically Explainable Vision Transformers [136.27303006772294]
We propose B-cos transformers, which inherently provide holistic explanations for their decisions.
Specifically, we formulate each model component - such as the multi-layer perceptrons, attention layers, and the tokenisation module - to be dynamic linear.
We apply our proposed design to Vision Transformers (ViTs) and show that the resulting models, dubbed Bcos-ViTs, are highly interpretable and perform competitively to baseline ViTs.
arXiv Detail & Related papers (2023-01-20T16:45:34Z) - Masked Vision-Language Transformers for Scene Text Recognition [10.057137581956363]
Scene text recognition (STR) enables computers to recognize and read the text in various real-world scenes.
Recent STR models benefit from taking linguistic information in addition to visual cues into consideration.
We propose a novel Masked Vision-Language Transformers (MVLT) to capture both the explicit and the implicit linguistic information.
arXiv Detail & Related papers (2022-11-09T10:28:23Z) - TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
Transformer [188.00681648113223]
We explore neat yet effective Transformer-based frameworks for visual grounding.
TransVG establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates.
We upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding.
arXiv Detail & Related papers (2022-06-14T06:27:38Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - On Vision Features in Multimodal Machine Translation [34.41229863267296]
We develop a selective attention model to study the patch-level contribution of an image in multimodal machine translation.
Our results suggest the need of carefully examining MMT models, especially when current benchmarks are small-scale and biased.
arXiv Detail & Related papers (2022-03-17T08:51:09Z) - VLMo: Unified Vision-Language Pre-Training with
Mixture-of-Modality-Experts [46.55920956687346]
We present a unified Vision-Language pretrained Model (VLMo) that jointly learns a dual encoder and a fusion encoder with a modular Transformer network.
Because of the modeling flexibility of MoME, pretrained VLMo can be fine-tuned as a fusion encoder for vision-language classification tasks.
We propose a stagewise pre-training strategy, which effectively leverages large-scale image-only and text-only data besides image-text pairs.
arXiv Detail & Related papers (2021-11-03T17:20:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.