Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic
Role Labeling
- URL: http://arxiv.org/abs/2308.05081v2
- Date: Sat, 12 Aug 2023 06:02:02 GMT
- Title: Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic
Role Labeling
- Authors: Yu Zhao, Hao Fei, Yixin Cao, Bobo Li, Meishan Zhang, Jianguo Wei, Min
Zhang, Tat-Seng Chua
- Abstract summary: Video Semantic Label Roleing (VidSRL) aims to detect salient events from given videos.
Recent endeavors have put forth methods for VidSRL, but they can be subject to two key drawbacks.
- Score: 96.64607294592062
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Video Semantic Role Labeling (VidSRL) aims to detect the salient events from
given videos, by recognizing the predict-argument event structures and the
interrelationships between events. While recent endeavors have put forth
methods for VidSRL, they can be mostly subject to two key drawbacks, including
the lack of fine-grained spatial scene perception and the insufficiently
modeling of video temporality. Towards this end, this work explores a novel
holistic spatio-temporal scene graph (namely HostSG) representation based on
the existing dynamic scene graph structures, which well model both the
fine-grained spatial semantics and temporal dynamics of videos for VidSRL.
Built upon the HostSG, we present a nichetargeting VidSRL framework. A
scene-event mapping mechanism is first designed to bridge the gap between the
underlying scene structure and the high-level event semantic structure,
resulting in an overall hierarchical scene-event (termed ICE) graph structure.
We further perform iterative structure refinement to optimize the ICE graph,
such that the overall structure representation can best coincide with end task
demand. Finally, three subtask predictions of VidSRL are jointly decoded, where
the end-to-end paradigm effectively avoids error propagation. On the benchmark
dataset, our framework boosts significantly over the current best-performing
model. Further analyses are shown for a better understanding of the advances of
our methods.
Related papers
- Rethinking Video Segmentation with Masked Video Consistency: Did the Model Learn as Intended? [22.191260650245443]
Video segmentation aims at partitioning video sequences into meaningful segments based on objects or regions of interest within frames.
Current video segmentation models are often derived from image segmentation techniques, which struggle to cope with small-scale or class-imbalanced video datasets.
We propose a training strategy Masked Video Consistency, which enhances spatial and temporal feature aggregation.
arXiv Detail & Related papers (2024-08-20T08:08:32Z) - Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment [130.15775113897553]
Finsta is a fine-grained structural-temporal alignment learning method.
It consistently improves the existing 13 strong-tuning video-language models.
arXiv Detail & Related papers (2024-06-27T15:23:36Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - Embracing Consistency: A One-Stage Approach for Spatio-Temporal Video
Grounding [35.73830796500975]
We present an end-to-end one-stage framework, termed Spatio-Temporal Consistency-Aware Transformer (STCAT)
To generate the above template under sufficient video- perception, an encoder-decoder architecture is proposed for effective global context modeling.
Our method outperforms previous state-of-the-arts with clear margins on two challenging video benchmarks.
arXiv Detail & Related papers (2022-09-27T11:13:04Z) - Target Adaptive Context Aggregation for Video Scene Graph Generation [36.669700084337045]
This paper deals with a challenging task of video scene graph generation (VidSGG)
We present a new em detect-to-track paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking.
arXiv Detail & Related papers (2021-08-18T12:46:28Z) - Learning to Associate Every Segment for Video Panoptic Segmentation [123.03617367709303]
We learn coarse segment-level matching and fine pixel-level matching together.
We show that our per-frame computation model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets.
arXiv Detail & Related papers (2021-06-17T13:06:24Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Hierarchical Attention Network for Action Segmentation [45.19890687786009]
The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video.
We propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time.
We evaluate our system on challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets.
arXiv Detail & Related papers (2020-05-07T02:39:18Z) - Fine-Grained Instance-Level Sketch-Based Video Retrieval [159.12935292432743]
We propose a novel cross-modal retrieval problem of fine-grained instance-level sketch-based video retrieval (FG-SBVR)
Compared with sketch-based still image retrieval, and coarse-grained category-level video retrieval, this is more challenging as both visual appearance and motion need to be simultaneously matched at a fine-grained level.
We show that this model significantly outperforms a number of existing state-of-the-art models designed for video analysis.
arXiv Detail & Related papers (2020-02-21T18:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.