Survey: Transformer based Video-Language Pre-training
- URL: http://arxiv.org/abs/2109.09920v1
- Date: Tue, 21 Sep 2021 02:36:06 GMT
- Title: Survey: Transformer based Video-Language Pre-training
- Authors: Ludan Ruan and Qin Jin
- Abstract summary: This survey aims to give a comprehensive overview on transformer-based pre-training methods for Video-Language learning.
We first briefly introduce the transformer tructure as the background knowledge, including attention mechanism, position encoding etc.
We categorize transformer models into Single-Stream and Multi-Stream structures, highlight their innovations and compare their performances.
- Score: 28.870441287367825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inspired by the success of transformer-based pre-training methods on natural
language tasks and further computer vision tasks, researchers have begun to
apply transformer to video processing. This survey aims to give a comprehensive
overview on transformer-based pre-training methods for Video-Language learning.
We first briefly introduce the transformer tructure as the background
knowledge, including attention mechanism, position encoding etc. We then
describe the typical paradigm of pre-training & fine-tuning on Video-Language
processing in terms of proxy tasks, downstream tasks and commonly used video
datasets. Next, we categorize transformer models into Single-Stream and
Multi-Stream structures, highlight their innovations and compare their
performances. Finally, we analyze and discuss the current challenges and
possible future research directions for Video-Language pre-training.
Related papers
- Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - Vision Language Transformers: A Survey [0.9137554315375919]
Vision language tasks, such as answering questions about or generating captions that describe an image, are difficult tasks for computers to perform.
Recent research has adapted the pretrained transformer architecture introduced in citetvaswani 2017attention to vision language modeling.
Transformer models have greatly improved performance and versatility over previous vision language models.
arXiv Detail & Related papers (2023-07-06T19:08:56Z) - Joint Moment Retrieval and Highlight Detection Via Natural Language
Queries [0.0]
We propose a new method for natural language query based joint video summarization and highlight detection.
This approach will use both visual and audio cues to match a user's natural language query to retrieve the most relevant and interesting moments from a video.
Our approach employs multiple recent techniques used in Vision Transformers (ViTs) to create a transformer-like encoder-decoder model.
arXiv Detail & Related papers (2023-05-08T18:00:33Z) - Think Before You Act: Unified Policy for Interleaving Language Reasoning
with Actions [21.72567982148215]
We show how to train transformers with a similar next-step prediction objective on offline data.
We propose a novel method for unifying language reasoning with actions in a single policy.
Specifically, we augment a transformer policy with word outputs, so it can generate textual captions interleaved with actions.
arXiv Detail & Related papers (2023-04-18T16:12:38Z) - Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments.
Our method consists of a multimodal transformer that encodes visual observations and language instructions.
We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z) - Pre-training image-language transformers for open-vocabulary tasks [53.446599611203474]
We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.
We explore both the use of image-text captioning data in pre-training, which does not need additional supervision, as well as object-aware strategies to pre-train the model.
We evaluate the method on a number of textgenerative vision+language tasks, such as Visual Question Answering, visual entailment and captioning, and demonstrate large gains over standard pre-training methods.
arXiv Detail & Related papers (2022-09-09T16:11:11Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Thinking Like Transformers [64.96770952820691]
We propose a computational model for the transformer-encoder in the form of a programming language.
We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer.
We provide RASP programs for histograms, sorting, and Dyck-languages.
arXiv Detail & Related papers (2021-06-13T13:04:46Z) - Episodic Transformer for Vision-and-Language Navigation [142.6236659368177]
This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions.
We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.
Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
arXiv Detail & Related papers (2021-05-13T17:51:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.