HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group Activity
Scene Graph Generation in Videos
- URL: http://arxiv.org/abs/2312.07740v1
- Date: Tue, 28 Nov 2023 16:04:54 GMT
- Title: HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group Activity
Scene Graph Generation in Videos
- Authors: Naga VS Raviteja Chappa, Pha Nguyen, Thi Hoang Ngan Le and Khoa Luu
- Abstract summary: Group Activity Scene Graph (GASG) generation is a challenging task in computer vision.
We introduce a GASG dataset extending the JRDB dataset with nuanced annotations involving textitAppearance, Interaction, Position, Relationship, and Situation attributes.
We also introduce an innovative approach, textbfHierarchical textbfAttention-textbfFlow (HAtt-Flow) Mechanism, rooted in flow network theory to enhance GASG performance.
- Score: 8.10024991952397
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Group Activity Scene Graph (GASG) generation is a challenging task in
computer vision, aiming to anticipate and describe relationships between
subjects and objects in video sequences. Traditional Video Scene Graph
Generation (VidSGG) methods focus on retrospective analysis, limiting their
predictive capabilities. To enrich the scene understanding capabilities, we
introduced a GASG dataset extending the JRDB dataset with nuanced annotations
involving \textit{Appearance, Interaction, Position, Relationship, and
Situation} attributes. This work also introduces an innovative approach,
\textbf{H}ierarchical \textbf{Att}ention-\textbf{Flow} (HAtt-Flow) Mechanism,
rooted in flow network theory to enhance GASG performance. Flow-Attention
incorporates flow conservation principles, fostering competition for sources
and allocation for sinks, effectively preventing the generation of trivial
attention. Our proposed approach offers a unique perspective on attention
mechanisms, where conventional "values" and "keys" are transformed into sources
and sinks, respectively, creating a novel framework for attention-based models.
Through extensive experiments, we demonstrate the effectiveness of our
Hatt-Flow model and the superiority of our proposed Flow-Attention mechanism.
This work represents a significant advancement in predictive video scene
understanding, providing valuable insights and techniques for applications that
require real-time relationship prediction in video data.
Related papers
- Revealing Decurve Flows for Generalized Graph Propagation [108.80758541147418]
This study addresses the limitations of the traditional analysis of message-passing, central to graph learning, by defining em textbfgeneralized propagation with directed and weighted graphs.
We include a preliminary exploration of learned propagation patterns in datasets, a first in the field.
arXiv Detail & Related papers (2024-02-13T14:13:17Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic
Role Labeling [96.64607294592062]
Video Semantic Label Roleing (VidSRL) aims to detect salient events from given videos.
Recent endeavors have put forth methods for VidSRL, but they can be subject to two key drawbacks.
arXiv Detail & Related papers (2023-08-09T17:20:14Z) - Self-Supervised Relation Alignment for Scene Graph Generation [44.3983804479146]
We introduce a self-supervised relational alignment regularization to improve scene graph generation performance.
The proposed alignment is general and can be combined with any existing scene graph generation framework.
We illustrate the effectiveness of this self-supervised relational alignment in conjunction with two scene graph generation architectures.
arXiv Detail & Related papers (2023-02-02T20:34:13Z) - Attention in Attention: Modeling Context Correlation for Efficient Video
Classification [47.938500236792244]
This paper proposes an efficient attention-in-attention (AIA) method for focus-wise feature refinement.
We instantiate video feature contexts as dynamics aggregated along a specific axis with global average and pooling operations.
All the computational operations in attention units act on the pooled dimension, which results in quite few computational cost increase.
arXiv Detail & Related papers (2022-04-20T08:37:52Z) - Deepened Graph Auto-Encoders Help Stabilize and Enhance Link Prediction [11.927046591097623]
Link prediction is a relatively under-studied graph learning task, with current state-of-the-art models based on one- or two-layers of shallow graph auto-encoder (GAE) architectures.
In this paper, we focus on addressing a limitation of current methods for link prediction, which can only use shallow GAEs and variational GAEs.
Our proposed methods innovatively incorporate standard auto-encoders (AEs) into the architectures of GAEs, where standard AEs are leveraged to learn essential, low-dimensional representations via seamlessly integrating the adjacency information and node features
arXiv Detail & Related papers (2021-03-21T14:43:10Z) - Variational Structured Attention Networks for Deep Visual Representation
Learning [49.80498066480928]
We propose a unified deep framework to jointly learn both spatial attention maps and channel attention in a principled manner.
Specifically, we integrate the estimation and the interaction of the attentions within a probabilistic representation learning framework.
We implement the inference rules within the neural network, thus allowing for end-to-end learning of the probabilistic and the CNN front-end parameters.
arXiv Detail & Related papers (2021-03-05T07:37:24Z) - Action Localization through Continual Predictive Learning [14.582013761620738]
We present a new approach based on continual learning that uses feature-level predictions for self-supervision.
We use a stack of LSTMs coupled with CNN encoder, along with novel attention mechanisms, to model the events in the video and use this model to predict high-level features for the future frames.
This self-supervised framework is not complicated as other approaches but is very effective in learning robust visual representations for both labeling and localization.
arXiv Detail & Related papers (2020-03-26T23:32:43Z) - Graph Representation Learning via Graphical Mutual Information
Maximization [86.32278001019854]
We propose a novel concept, Graphical Mutual Information (GMI), to measure the correlation between input graphs and high-level hidden representations.
We develop an unsupervised learning model trained by maximizing GMI between the input and output of a graph neural encoder.
arXiv Detail & Related papers (2020-02-04T08:33:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.