What is More Likely to Happen Next? Video-and-Language Future Event
Prediction
- URL: http://arxiv.org/abs/2010.07999v1
- Date: Thu, 15 Oct 2020 19:56:47 GMT
- Title: What is More Likely to Happen Next? Video-and-Language Future Event
Prediction
- Authors: Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal
- Abstract summary: Given a video with aligned dialogue, people can often infer what is more likely to happen next.
In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions.
We collect a new dataset, named Video-and-Language Event Prediction, with 28,726 future event prediction examples.
- Score: 111.93601253692165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a video with aligned dialogue, people can often infer what is more
likely to happen next. Making such predictions requires not only a deep
understanding of the rich dynamics underlying the video and dialogue, but also
a significant amount of commonsense knowledge. In this work, we explore whether
AI models are able to learn to make such multimodal commonsense next-event
predictions. To support research in this direction, we collect a new dataset,
named Video-and-Language Event Prediction (VLEP), with 28,726 future event
prediction examples (along with their rationales) from 10,234 diverse TV Show
and YouTube Lifestyle Vlog video clips. In order to promote the collection of
non-trivial challenging examples, we employ an adversarial
human-and-model-in-the-loop data collection procedure. We also present a strong
baseline incorporating information from video, dialogue, and commonsense
knowledge. Experiments show that each type of information is useful for this
challenging task, and that compared to the high human performance on VLEP, our
model provides a good starting point but leaves large room for future work. Our
dataset and code are available at:
https://github.com/jayleicn/VideoLanguageFuturePred
Related papers
- Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - SPOT! Revisiting Video-Language Models for Event Understanding [31.49859545456809]
We introduce SPOT Prober, to benchmark existing video-language models's capacities of distinguishing event-level discrepancies.
We evaluate the existing video-language models with these positive and negative captions and find they fail to distinguish most of the manipulated events.
Based on our findings, we propose to plug in these manipulated event captions as hard negative samples and find them effective in enhancing models for event understanding.
arXiv Detail & Related papers (2023-11-21T18:43:07Z) - Human-Object Interaction Prediction in Videos through Gaze Following [9.61701724661823]
We design a framework to detect current HOIs and anticipate future HOIs in videos.
We propose to leverage human information since people often fixate on an object before interacting with it.
Our model is trained and validated on the VidHOI dataset, which contains videos capturing daily life.
arXiv Detail & Related papers (2023-06-06T11:36:14Z) - How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing.
The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states.
The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.