Fill-in-the-blank as a Challenging Video Understanding Evaluation
Framework
- URL: http://arxiv.org/abs/2104.04182v1
- Date: Fri, 9 Apr 2021 04:00:10 GMT
- Title: Fill-in-the-blank as a Challenging Video Understanding Evaluation
Framework
- Authors: Santiago Castro, Ruoyao Wang, Pingxuan Huang, Ian Stewart, Nan Liu,
Jonathan Stroud, Rada Mihalcea
- Abstract summary: We introduce a novel dataset consisting of 28,000 videos and fill-in-the-blank tests.
We show that both a multimodal model and a strong language model have a large gap with human performance.
- Score: 19.031957183047048
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Work to date on language-informed video understanding has primarily addressed
two tasks: (1) video question answering using multiple-choice questions, where
models perform relatively well because they exploit the fact that candidate
answers are readily available; and (2) video captioning, which relies on an
open-ended evaluation framework that is often inaccurate because system answers
may be perceived as incorrect if they differ in form from the ground truth. In
this paper, we propose fill-in-the-blanks as a video understanding evaluation
framework that addresses these previous evaluation drawbacks, and more closely
reflects real-life settings where no multiple choices are given. The task tests
a system understanding of a video by requiring the model to predict a masked
noun phrase in the caption of the video, given the video and the surrounding
text. We introduce a novel dataset consisting of 28,000 videos and
fill-in-the-blank tests. We show that both a multimodal model and a strong
language model have a large gap with human performance, thus suggesting that
the task is more challenging than current video understanding benchmarks.
Related papers
- TVBench: Redesigning Video-Language Evaluation [48.71203934876828]
We show that the currently most used video-language benchmarks can be solved without requiring much temporal reasoning.
We propose TVBench, a novel open-source video multiple-choice question-answering benchmark.
arXiv Detail & Related papers (2024-10-10T09:28:36Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - Analyzing Zero-Shot Abilities of Vision-Language Models on Video
Understanding Tasks [6.925770576386087]
We propose a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting.
Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC.
These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.
arXiv Detail & Related papers (2023-10-07T20:57:54Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Video as Conditional Graph Hierarchy for Multi-Granular Question
Answering [80.94367625007352]
We argue that while video is presented in frame sequence, the visual elements are not sequential but rather hierarchical in semantic space.
We propose to model video as a conditional graph hierarchy which weaves together visual facts of different granularity in a level-wise manner.
arXiv Detail & Related papers (2021-12-12T10:35:19Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.