Related papers: What is More Likely to Happen Next? Video-and-Language Future Event Prediction

What is More Likely to Happen Next? Video-and-Language Future Event Prediction

URL: http://arxiv.org/abs/2010.07999v1
Date: Thu, 15 Oct 2020 19:56:47 GMT
Title: What is More Likely to Happen Next? Video-and-Language Future Event Prediction
Authors: Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal
Abstract summary: Given a video with aligned dialogue, people can often infer what is more likely to happen next. In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions. We collect a new dataset, named Video-and-Language Event Prediction, with 28,726 future event prediction examples.
Score: 111.93601253692165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Given a video with aligned dialogue, people can often infer what is more likely to happen next. Making such predictions requires not only a deep understanding of the rich dynamics underlying the video and dialogue, but also a significant amount of commonsense knowledge. In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions. To support research in this direction, we collect a new dataset, named Video-and-Language Event Prediction (VLEP), with 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog video clips. In order to promote the collection of non-trivial challenging examples, we employ an adversarial human-and-model-in-the-loop data collection procedure. We also present a strong baseline incorporating information from video, dialogue, and commonsense knowledge. Experiments show that each type of information is useful for this challenging task, and that compared to the high human performance on VLEP, our model provides a good starting point but leaves large room for future work. Our dataset and code are available at: https://github.com/jayleicn/VideoLanguageFuturePred

Related papers

VidEvent: A Large Dataset for Understanding Dynamic Evolution of Events in Videos [6.442765801124304]
We propose the task of video event understanding that extracts event scripts and makes predictions with these scripts from videos.<n>To support this task, we introduce VidEvent, a large-scale dataset containing over 23,000 well-labeled events.<n>The dataset was created through a meticulous annotation process, ensuring high-quality and reliable event data.
arXiv Detail & Related papers (2025-06-03T05:12:48Z)
Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past. We leverage the large-scale pretraining of image diffusion models which can handle multi-modality. We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z)
SPOT! Revisiting Video-Language Models for Event Understanding [31.49859545456809]
We introduce SPOT Prober, to benchmark existing video-language models's capacities of distinguishing event-level discrepancies. We evaluate the existing video-language models with these positive and negative captions and find they fail to distinguish most of the manipulated events. Based on our findings, we propose to plug in these manipulated event captions as hard negative samples and find them effective in enhancing models for event understanding.
arXiv Detail & Related papers (2023-11-21T18:43:07Z)
Human-Object Interaction Prediction in Videos through Gaze Following [9.61701724661823]
We design a framework to detect current HOIs and anticipate future HOIs in videos. We propose to leverage human information since people often fixate on an object before interacting with it. Our model is trained and validated on the VidHOI dataset, which contains videos capturing daily life.
arXiv Detail & Related papers (2023-06-06T11:36:14Z)
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios [73.24092762346095]
We introduce two large-scale datasets with over 60,000 videos annotated for emotional response and subjective wellbeing. The Video Cognitive Empathy dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states. The Video to Valence dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing.
arXiv Detail & Related papers (2022-10-18T17:58:25Z)
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z)
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events. We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z)
VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text. Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.