Video Understanding as Machine Translation
- URL: http://arxiv.org/abs/2006.07203v2
- Date: Thu, 17 Sep 2020 19:41:31 GMT
- Title: Video Understanding as Machine Translation
- Authors: Bruno Korbar, Fabio Petroni, Rohit Girdhar, Lorenzo Torresani
- Abstract summary: We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
- Score: 53.59298393079866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advent of large-scale multimodal video datasets, especially
sequences with audio or transcribed speech, there has been a growing interest
in self-supervised learning of video representations. Most prior work
formulates the objective as a contrastive metric learning problem between the
modalities. To enable effective learning, however, these strategies require a
careful selection of positive and negative samples often combined with
hand-designed curriculum policies. In this work we remove the need for negative
sampling by taking a generative modeling approach that poses the objective as a
translation problem between modalities. Such a formulation allows us to tackle
a wide variety of downstream video understanding tasks by means of a single
unified framework, without the need for large batches of negative samples
common in contrastive metric learning. We experiment with the large-scale
HowTo100M dataset for training, and report performance gains over the
state-of-the-art on several downstream tasks including video classification
(EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and
MSR-VTT), and text-based clip retrieval (YouCook2 and MSR-VTT).
Related papers
- MAMA: Meta-optimized Angular Margin Contrastive Framework for Video-Language Representation Learning [34.259833094575285]
MAMA is a new approach to learning video-language representations by utilizing a contrastive objective with a subtractive angular margin.
MAMA improves video-language representations and achieve superior performances on commonly used video question answering and text-video retrieval datasets.
arXiv Detail & Related papers (2024-07-04T09:52:17Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial
Margin Contrastive Learning [35.404100473539195]
Text-video retrieval aims to rank relevant text/video higher than irrelevant ones.
Recent contrastive learning methods have shown promising results for text-video retrieval.
This paper improves contrastive learning using two novel techniques.
arXiv Detail & Related papers (2023-09-20T06:08:11Z) - MAViC: Multimodal Active Learning for Video Captioning [8.454261564411436]
In this paper, we introduce MAViC to address the challenges of active learning approaches for video captioning.
Our approach integrates semantic similarity and uncertainty of both visual and language dimensions in the acquisition function.
arXiv Detail & Related papers (2022-12-11T18:51:57Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Self-supervised pre-training and contrastive representation learning for
multiple-choice video QA [39.78914328623504]
Video Question Answering (Video QA) requires fine-grained understanding of both video and language modalities to answer the given questions.
We propose novel training schemes for multiple-choice video question answering with a self-supervised pre-training stage and a supervised contrastive learning in the main stage as an auxiliary learning.
We evaluate our proposed model on highly competitive benchmark datasets related to multiple-choice video QA: TVQA, TVQA+, and DramaQA.
arXiv Detail & Related papers (2020-09-17T03:37:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.