ViLT: Vision-and-Language Transformer Without Convolution or Region
Supervision
- URL: http://arxiv.org/abs/2102.03334v1
- Date: Fri, 5 Feb 2021 18:36:11 GMT
- Title: ViLT: Vision-and-Language Transformer Without Convolution or Region
Supervision
- Authors: Wonjae Kim, Bokyung Son, Ildoo Kim
- Abstract summary: We present a minimal Vision-and-Language Transformer (ViLT) model for vision-and-language downstream tasks.
ViLT is monolithic in the sense that processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs.
- Score: 10.584604416749965
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-and-Language Pretraining (VLP) has improved performance on various
joint vision-and-language downstream tasks. Current approaches for VLP heavily
rely on image feature extraction processes, most of which involve region
supervisions (e.g., object detection) and the convolutional architecture (e.g.,
ResNet). Although disregarded in the literature, we find it problematic in
terms of both (1) efficiency/speed, that simply extracting input features
requires much more computation than the actual multimodal interaction steps;
and (2) expressive power, as it is upper bounded to the expressive power of the
visual encoder and its predefined visual vocabulary. In this paper, we present
a minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the
sense that processing of visual inputs is drastically simplified to just the
same convolution-free manner that we process textual inputs. We show that ViLT
is up to 60 times faster than previous VLP models, yet with competitive or
better downstream task performance.
Related papers
- Progressive Multi-modal Conditional Prompt Tuning [92.50645776024624]
Pre-trained vision-language models (VLMs) have shown remarkable generalization capabilities via prompting.
We propose a novel method, Progressive Multi-modal conditional Prompt Tuning (ProMPT)
ProMPT exploits a recurrent structure, optimizing and aligning V-L features by iteratively utilizing image and current encoding information.
arXiv Detail & Related papers (2024-04-18T02:40:31Z) - Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection [66.72992463712299]
Vision Transformers (ViTs) have become increasingly popular in large-scale Vision and Language Pre-training models.
Previous research has demonstrated the efficacy of ViTs, but they still struggle with computational inefficiencies caused by lengthy visual sequences.
We introduce TRIPS, which reduces the visual sequence using a text-guided patch-selection layer in the visual backbone.
Our experimental results reveal that TRIPS delivers a 40% speedup, while maintaining competitive or superior performance on downstream tasks.
arXiv Detail & Related papers (2024-01-11T14:31:30Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - CAVL: Learning Contrastive and Adaptive Representations of Vision and
Language [10.57079240576682]
Visual and linguistic pre-training aims to learn vision and language representations together.
Current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks.
We present a simple but effective approach for learning Contrastive and Adaptive representations of Vision and Language, namely CAVL.
arXiv Detail & Related papers (2023-04-10T05:54:03Z) - TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs.
TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks.
Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs.
CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z) - E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual
Learning [31.622393984150314]
We propose the first end-to-end vision-language pre-trained model for both V+L understanding and generation.
We build a unified Transformer framework to jointly learn visual representation, and semantic alignments between image and text.
arXiv Detail & Related papers (2021-06-03T12:50:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.