Visual Semantic Role Labeling for Video Understanding
- URL: http://arxiv.org/abs/2104.00990v1
- Date: Fri, 2 Apr 2021 11:23:22 GMT
- Title: Visual Semantic Role Labeling for Video Understanding
- Authors: Arka Sadhu, Tanmay Gupta, Mark Yatskar, Ram Nevatia, Aniruddha
Kembhavi
- Abstract summary: We propose a new framework for understanding and representing related salient events in a video using visual semantic role labeling.
We represent videos as a set of related events, wherein each event consists of a verb and multiple entities that fulfill various roles relevant to that event.
We introduce the VidSitu benchmark, a large-scale video understanding data source with $29K$ $10$-second movie clips richly annotated with a verb and semantic-roles every $2$ seconds.
- Score: 46.02181466801726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new framework for understanding and representing related salient
events in a video using visual semantic role labeling. We represent videos as a
set of related events, wherein each event consists of a verb and multiple
entities that fulfill various roles relevant to that event. To study the
challenging task of semantic role labeling in videos or VidSRL, we introduce
the VidSitu benchmark, a large-scale video understanding data source with $29K$
$10$-second movie clips richly annotated with a verb and semantic-roles every
$2$ seconds. Entities are co-referenced across events within a movie clip and
events are connected to each other via event-event relations. Clips in VidSitu
are drawn from a large collection of movies (${\sim}3K$) and have been chosen
to be both complex (${\sim}4.2$ unique verbs within a video) as well as diverse
(${\sim}200$ verbs have more than $100$ annotations each). We provide a
comprehensive analysis of the dataset in comparison to other publicly available
video understanding benchmarks, several illustrative baselines and evaluate a
range of standard video recognition models. Our code and dataset is available
at vidsitu.org.
Related papers
- Learning Video Context as Interleaved Multimodal Sequences [40.15446453928028]
MovieSeq is a multimodal language model developed to address the wide range of challenges in understanding video contexts.
Our core idea is to represent videos as interleaved multimodal sequences, either by linking external knowledge databases or using offline models.
To demonstrate its effectiveness, we validate MovieSeq's performance on six datasets.
arXiv Detail & Related papers (2024-07-31T17:23:57Z) - SPOT! Revisiting Video-Language Models for Event Understanding [31.49859545456809]
We introduce SPOT Prober, to benchmark existing video-language models's capacities of distinguishing event-level discrepancies.
We evaluate the existing video-language models with these positive and negative captions and find they fail to distinguish most of the manipulated events.
Based on our findings, we propose to plug in these manipulated event captions as hard negative samples and find them effective in enhancing models for event understanding.
arXiv Detail & Related papers (2023-11-21T18:43:07Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - Connecting Vision and Language with Video Localized Narratives [54.094554472715245]
We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language.
In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment.
Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects.
arXiv Detail & Related papers (2023-02-22T09:04:00Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.