Vision Language Transformers: A Survey
- URL: http://arxiv.org/abs/2307.03254v1
- Date: Thu, 6 Jul 2023 19:08:56 GMT
- Title: Vision Language Transformers: A Survey
- Authors: Clayton Fields, Casey Kennington
- Abstract summary: Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform.
Recent research has adapted the pretrained transformer architecture introduced in citetvaswani 2017attention to vision language modeling.
Transformer models have greatly improved performance and versatility over previous vision language models.
- Score: 0.9137554315375919
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision language tasks, such as answering questions about or generating
captions that describe an image, are difficult tasks for computers to perform.
A relatively recent body of research has adapted the pretrained transformer
architecture introduced in \citet{vaswani2017attention} to vision language
modeling. Transformer models have greatly improved performance and versatility
over previous vision language models. They do so by pretraining models on a
large generic datasets and transferring their learning to new tasks with minor
changes in architecture and parameter values. This type of transfer learning
has become the standard modeling practice in both natural language processing
and computer vision. Vision language transformers offer the promise of
producing similar advancements in tasks which require both vision and language.
In this paper, we provide a broad synthesis of the currently available research
on vision language transformer models and offer some analysis of their
strengths, limitations and some open questions that remain.
Related papers
- A Review of Transformer-Based Models for Computer Vision Tasks: Capturing Global Context and Spatial Relationships [0.5639904484784127]
Transformer-based models have transformed the landscape of natural language processing (NLP)
These models are renowned for their ability to capture long-range dependencies and contextual information.
We discuss potential research directions and applications of transformer-based models in computer vision.
arXiv Detail & Related papers (2024-08-27T16:22:18Z) - Language-Driven Representation Learning for Robotics [115.93273609767145]
Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks.
We introduce a framework for language-driven representation learning from human videos and captions.
We find that Voltron's language-driven learning outperform the prior-of-the-art, especially on targeted problems requiring higher-level control.
arXiv Detail & Related papers (2023-02-24T17:29:31Z) - Is Multimodal Vision Supervision Beneficial to Language? [2.216702991322677]
Vision (image and video) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks.
We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision.
arXiv Detail & Related papers (2023-02-10T02:22:44Z) - Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments.
Our method consists of a multimodal transformer that encodes visual observations and language instructions.
We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z) - PaLI: A Jointly-Scaled Multilingual Language-Image Model [110.10710554358455]
PaLI (Pathways Language and Image model) is a model that extends this approach to the joint modeling of language and vision.
We create a large multilingual mix of pretraining tasks, based on a new image-text training set containing 10B images and texts in over 100 languages.
arXiv Detail & Related papers (2022-09-14T17:24:07Z) - Pre-training image-language transformers for open-vocabulary tasks [53.446599611203474]
We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.
We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model.
We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.
arXiv Detail & Related papers (2022-09-09T16:11:11Z) - Episodic Transformer for Vision-and-Language Navigation [142.6236659368177]
This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions.
We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.
Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
arXiv Detail & Related papers (2021-05-13T17:51:46Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.