VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation
- URL: http://arxiv.org/abs/2106.04632v1
- Date: Tue, 8 Jun 2021 18:34:21 GMT
- Title: VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation
- Authors: Linjie Li, Jie Lei, Zhe Gan, Licheng Yu, Yen-Chun Chen, Rohit Pillai,
Yu Cheng, Luowei Zhou, Xin Eric Wang, William Yang Wang, Tamara Lee Berg,
Mohit Bansal, Jingjing Liu, Lijuan Wang, Zicheng Liu
- Abstract summary: VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
- Score: 124.02278735049235
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Most existing video-and-language (VidL) research focuses on a single dataset,
or multiple datasets of a single task. In reality, a truly useful VidL system
is expected to be easily generalizable to diverse tasks, domains, and datasets.
To facilitate the evaluation of such systems, we introduce Video-And-Language
Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets
over 3 popular tasks: (i) text-to-video retrieval; (ii) video question
answering; and (iii) video captioning. VALUE benchmark aims to cover a broad
range of video genres, video lengths, data volumes, and task difficulty levels.
Rather than focusing on single-channel videos with visual information only,
VALUE promotes models that leverage information from both video frames and
their associated subtitles, as well as models that share knowledge across
multiple tasks. We evaluate various baseline methods with and without
large-scale VidL pre-training, and systematically investigate the impact of
video input channels, fusion methods, and different video representations. We
also study the transferability between tasks, and conduct multi-task learning
under different settings. The significant gap between our best model and human
performance calls for future study for advanced VidL models. VALUE is available
at https://value-leaderboard.github.io/.
Related papers
- DLM-VMTL:A Double Layer Mapper for heterogeneous data video Multi-task prompt learning [2.4121373594852846]
Multi-Task Learning makes the visual task acquire the rich shareable knowledge from other tasks while joint training.
Heterogenous data video multi-task prompt learning (VMTL) method is proposed to address above problem.
Double-Layers Mapper(DLM) is proposed to extract the shareable knowledge into visual promptS and align it with representation of primary task.
arXiv Detail & Related papers (2024-08-29T01:25:36Z) - Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data [19.210471935816273]
We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD) and a new Feint6K dataset.
To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning.
Our approach successfully learn more discriminative action embeddings and improves results on Feint6K when applied to multiple video-text models.
arXiv Detail & Related papers (2024-07-18T01:55:48Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.