Video Event Extraction via Tracking Visual States of Arguments
- URL: http://arxiv.org/abs/2211.01781v2
- Date: Sat, 5 Nov 2022 15:27:43 GMT
- Title: Video Event Extraction via Tracking Visual States of Arguments
- Authors: Guang Yang, Manling Li, Jiajie Zhang, Xudong Lin, Shih-Fu Chang, Heng
Ji
- Abstract summary: We propose a novel framework to detect video events by tracking the changes in the visual states of all involved arguments.
In order to capture the visual state changes of arguments, we decompose them into changes in pixels within objects, displacements of objects, and interactions among multiple arguments.
- Score: 72.54932474653444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video event extraction aims to detect salient events from a video and
identify the arguments for each event as well as their semantic roles. Existing
methods focus on capturing the overall visual scene of each frame, ignoring
fine-grained argument-level information. Inspired by the definition of events
as changes of states, we propose a novel framework to detect video events by
tracking the changes in the visual states of all involved arguments, which are
expected to provide the most informative evidence for the extraction of video
events. In order to capture the visual state changes of arguments, we decompose
them into changes in pixels within objects, displacements of objects, and
interactions among multiple arguments. We further propose Object State
Embedding, Object Motion-aware Embedding and Argument Interaction Embedding to
encode and track these changes respectively. Experiments on various video event
extraction tasks demonstrate significant improvements compared to
state-of-the-art models. In particular, on verb classification, we achieve
3.49% absolute gains (19.53% relative gains) in F1@5 on Video Situation
Recognition.
Related papers
- EA-VTR: Event-Aware Video-Text Retrieval [97.30850809266725]
Event-Aware Video-Text Retrieval model achieves powerful video-text retrieval ability through superior video event awareness.
EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment.
arXiv Detail & Related papers (2024-07-10T09:09:58Z) - Event-aware Video Corpus Moment Retrieval [79.48249428428802]
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos.
Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos.
We propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval.
arXiv Detail & Related papers (2024-02-21T06:55:20Z) - SPOT! Revisiting Video-Language Models for Event Understanding [31.49859545456809]
We introduce SPOT Prober, to benchmark existing video-language models's capacities of distinguishing event-level discrepancies.
We evaluate the existing video-language models with these positive and negative captions and find they fail to distinguish most of the manipulated events.
Based on our findings, we propose to plug in these manipulated event captions as hard negative samples and find them effective in enhancing models for event understanding.
arXiv Detail & Related papers (2023-11-21T18:43:07Z) - Reasoning over the Behaviour of Objects in Video-Clips for Adverb-Type Recognition [54.938128496934695]
We propose a new framework that reasons over object-behaviours extracted from raw-video-clips to recognize the clip's corresponding adverb-types.
Specifically, we propose a novel pipeline that extracts human-interpretable object-behaviour-facts from raw video clips.
We release two new datasets of object-behaviour-facts extracted from raw video clips.
arXiv Detail & Related papers (2023-07-09T09:04:26Z) - Video Segmentation Learning Using Cascade Residual Convolutional Neural
Network [0.0]
We propose a novel deep learning video segmentation approach that incorporates residual information into the foreground detection learning process.
Experiments conducted on the Change Detection 2014 and on the private dataset PetrobrasROUTES from Petrobras support the effectiveness of the proposed approach.
arXiv Detail & Related papers (2022-12-20T16:56:54Z) - VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video
Paragraph Captioning [19.73126931526359]
Video paragraph captioning aims to generate a multi-sentence description of an untrimmed video with several temporal event locations in coherent storytelling.
We first propose a visual-linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements.
We then introduce an autoregressive Transformer-in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video.
arXiv Detail & Related papers (2022-11-28T07:39:20Z) - Towards Diverse Paragraph Captioning for Untrimmed Videos [40.205433926432434]
Existing approaches mainly solve the problem in two steps: event detection and then event captioning.
We propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos.
arXiv Detail & Related papers (2021-05-30T09:28:43Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.