Grounded Video Situation Recognition
- URL: http://arxiv.org/abs/2210.10828v1
- Date: Wed, 19 Oct 2022 18:38:10 GMT
- Title: Grounded Video Situation Recognition
- Authors: Zeeshan Khan, C.V. Jawahar, Makarand Tapaswi
- Abstract summary: We present a novel three stage Transformer model, VideoWhisperer, that is empowered to make joint predictions.
Our model operates on a group of events (clips) simultaneously and predicts verbs, verb-role pairs, their nouns, and their grounding on-the-fly.
- Score: 37.279915290069326
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dense video understanding requires answering several questions such as who is
doing what to whom, with what, how, why, and where. Recently, Video Situation
Recognition (VidSitu) is framed as a task for structured prediction of multiple
events, their relationships, and actions and various verb-role pairs attached
to descriptive entities. This task poses several challenges in identifying,
disambiguating, and co-referencing entities across multiple verb-role pairs,
but also faces some challenges of evaluation. In this work, we propose the
addition of spatio-temporal grounding as an essential component of the
structured prediction task in a weakly supervised setting, and present a novel
three stage Transformer model, VideoWhisperer, that is empowered to make joint
predictions. In stage one, we learn contextualised embeddings for video
features in parallel with key objects that appear in the video clips to enable
fine-grained spatio-temporal reasoning. The second stage sees verb-role queries
attend and pool information from object embeddings, localising answers to
questions posed about the action. The final stage generates these answers as
captions to describe each verb-role pair present in the video. Our model
operates on a group of events (clips) simultaneously and predicts verbs,
verb-role pairs, their nouns, and their grounding on-the-fly. When evaluated on
a grounding-augmented version of the VidSitu dataset, we observe a large
improvement in entity captioning accuracy, as well as the ability to localize
verb-roles without grounding annotations at training time.
Related papers
- Training-free Video Temporal Grounding using Large-scale Pre-trained Models [41.71055776623368]
Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query.
Existing video temporal localization models rely on specific datasets for training and have high data collection costs.
We propose a Training-Free Video Temporal Grounding approach that leverages the ability of pre-trained large models.
arXiv Detail & Related papers (2024-08-29T02:25:12Z) - SPOT! Revisiting Video-Language Models for Event Understanding [31.49859545456809]
We introduce SPOT Prober, to benchmark existing video-language models's capacities of distinguishing event-level discrepancies.
We evaluate the existing video-language models with these positive and negative captions and find they fail to distinguish most of the manipulated events.
Based on our findings, we propose to plug in these manipulated event captions as hard negative samples and find them effective in enhancing models for event understanding.
arXiv Detail & Related papers (2023-11-21T18:43:07Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z) - Cross-Sentence Temporal and Semantic Relations in Video Activity
Localisation [79.50868197788773]
We develop a more accurate weakly-supervised solution by introducing Cross-Sentence Relations Mining.
We explore two cross-sentence relational constraints: (1) trimmed ordering and (2) semantic consistency among sentences in a paragraph description of video activities.
Experiments on two publicly available activity localisation datasets show the advantages of our approach over the state-of-the-art weakly supervised methods.
arXiv Detail & Related papers (2021-07-23T20:04:01Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.