VAuLT: Augmenting the Vision-and-Language Transformer with the
Propagation of Deep Language Representations
- URL: http://arxiv.org/abs/2208.09021v1
- Date: Thu, 18 Aug 2022 18:51:13 GMT
- Title: VAuLT: Augmenting the Vision-and-Language Transformer with the
Propagation of Deep Language Representations
- Authors: Georgios Chochlakis, Tejas Srinivasan, Jesse Thomason, Shrikanth
Narayanan (University of Southern California)
- Abstract summary: We propose the Vision-and-Augmented-Language Transformer (VAuLT)
VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language tasks.
We show that such a strategy significantly improves over ViLT on vision-and-language tasks involving richer language inputs.
- Score: 6.405005247717135
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an
extension of the popular Vision-and-Language Transformer (ViLT), and improves
performance on vision-and-language tasks that involve more complex text inputs
than image captions while having minimal impact on training and inference
efficiency. ViLT, importantly, enables efficient training and inference in
vision-and-language tasks, achieved by using a shallow image encoder. However,
it is pretrained on captioning and similar datasets, where the language input
is simple, literal, and descriptive, therefore lacking linguistic diversity.
So, when working with multimedia data in the wild, such as multimodal social
media data (in our work, Twitter), there is a notable shift from captioning
language data, as well as diversity of tasks, and we indeed find evidence that
the language capacity of ViLT is lacking instead. The key insight of VAuLT is
to propagate the output representations of a large language model like BERT to
the language input of ViLT. We show that such a strategy significantly improves
over ViLT on vision-and-language tasks involving richer language inputs and
affective constructs, such as TWITTER-2015, TWITTER-2017, MVSA-Single and
MVSA-Multiple, but lags behind pure reasoning tasks such as the Bloomberg
Twitter Text-Image Relationship dataset. We have released the code for all our
experiments at https://github.com/gchochla/VAuLT.
Related papers
- Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods mainly focus on aligning vision encoders with Multimodal Large Language Models (MLLMs)
We introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level.
Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks.
arXiv Detail & Related papers (2024-06-04T17:56:28Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - Stop Pre-Training: Adapt Visual-Language Models to Unseen Languages [3.3227703089509304]
We propose a simple yet efficient approach to adapt Vision-Language Pre-training to unseen languages using MPLM.
Our approach does not require image input and primarily uses machine translation, eliminating the need for target language data.
arXiv Detail & Related papers (2023-06-29T08:20:57Z) - Augmented Transformers with Adaptive n-grams Embedding for Multilingual
Scene Text Recognition [10.130342722193204]
This paper proposes an augmented transformer architecture with n-grams embedding and cross-language rectification (TANGER)
TANGER consists of a primary transformer with single patch embeddings of visual images, and a supplementary transformer with adaptive n-grams embeddings.
Cross-language rectification is achieved with a loss function that takes into account both language identification and contextual coherence scoring.
arXiv Detail & Related papers (2023-02-28T02:37:30Z) - PaLI: A Jointly-Scaled Multilingual Language-Image Model [110.10710554358455]
PaLI (Pathways Language and Image model) is a model that extends this approach to the joint modeling of language and vision.
We create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages.
arXiv Detail & Related papers (2022-09-14T17:24:07Z) - XDBERT: Distilling Visual Information to BERT from Cross-Modal Systems
to Improve Language Understanding [73.24847320536813]
This study explores distilling visual information from pretrained multimodal transformers to pretrained language encoders.
Our framework is inspired by cross-modal encoders' success in visual-language tasks while we alter the learning objective to cater to the language-heavy characteristics of NLU.
arXiv Detail & Related papers (2022-04-15T03:44:00Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.