Unified Graph Structured Models for Video Understanding
- URL: http://arxiv.org/abs/2103.15662v1
- Date: Mon, 29 Mar 2021 14:37:35 GMT
- Title: Unified Graph Structured Models for Video Understanding
- Authors: Anurag Arnab, Chen Sun, Cordelia Schmid
- Abstract summary: We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
- Score: 93.72081456202672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate video understanding involves reasoning about the relationships
between actors, objects and their environment, often over long temporal
intervals. In this paper, we propose a message passing graph neural network
that explicitly models these spatio-temporal relations and can use explicit
representations of objects, when supervision is available, and implicit
representations otherwise. Our formulation generalises previous structured
models for video understanding, and allows us to study how different design
choices in graph structure and representation affect the model's performance.
We demonstrate our method on two different tasks requiring relational reasoning
in videos -- spatio-temporal action detection on AVA and UCF101-24, and video
scene graph classification on the recent Action Genome dataset -- and achieve
state-of-the-art results on all three datasets. Furthermore, we show
quantitatively and qualitatively how our method is able to more effectively
model relationships between relevant entities in the scene.
Related papers
- Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection [14.22646492640906]
We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection.
Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly.
Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds.
arXiv Detail & Related papers (2024-03-21T10:15:57Z) - Towards Scene Graph Anticipation [10.678727237318503]
We introduce the task of Scene Graph Anticipation (SGA)
We adapt state-of-the-art scene graph generation methods as baselines to anticipate future pair-wise relationships between objects.
In SceneSayer, we leverage object-centric representations of relationships to reason about the observed video frames and model the evolution of relationships between objects.
arXiv Detail & Related papers (2024-03-07T21:08:51Z) - Spatio-Temporal Relation Learning for Video Anomaly Detection [35.59510027883497]
Anomaly identification is highly dependent on the relationship between the object and the scene.
In this paper, we propose a Spatial-Temporal Relation Learning framework to tackle the video anomaly detection task.
Experiments are conducted on three public datasets, and the superior performance over the state-of-the-art methods demonstrates the effectiveness of our method.
arXiv Detail & Related papers (2022-09-27T02:19:31Z) - Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action
Segmentation in Videos [0.40778318140713216]
This study introduces a graph-structured approach named Semantic2Graph, to model long-term dependencies in videos.
We have designed positive and negative semantic edges, accompanied by corresponding edge weights, to capture both long-term and short-term semantic relationships in video actions.
arXiv Detail & Related papers (2022-09-13T00:01:23Z) - Temporal Relevance Analysis for Video Action Models [70.39411261685963]
We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models.
We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected.
arXiv Detail & Related papers (2022-04-25T19:06:48Z) - Learning to Associate Every Segment for Video Panoptic Segmentation [123.03617367709303]
We learn coarse segment-level matching and fine pixel-level matching together.
We show that our per-frame computation model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets.
arXiv Detail & Related papers (2021-06-17T13:06:24Z) - TCL: Transformer-based Dynamic Graph Modelling via Contrastive Learning [87.38675639186405]
We propose a novel graph neural network approach, called TCL, which deals with the dynamically-evolving graph in a continuous-time fashion.
To the best of our knowledge, this is the first attempt to apply contrastive learning to representation learning on dynamic graphs.
arXiv Detail & Related papers (2021-05-17T15:33:25Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.