Spot What Matters: Learning Context Using Graph Convolutional Networks
for Weakly-Supervised Action Detection
- URL: http://arxiv.org/abs/2107.13648v1
- Date: Wed, 28 Jul 2021 21:37:18 GMT
- Title: Spot What Matters: Learning Context Using Graph Convolutional Networks
for Weakly-Supervised Action Detection
- Authors: Michail Tsiaousis, Gertjan Burghouts, Fieke Hillerstr\"om and Peter
van der Putten
- Abstract summary: We introduce an architecture based on self-attention and Convolutional Networks to improve human action detection in video.
Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training.
Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The dominant paradigm in spatiotemporal action detection is to classify
actions using spatiotemporal features learned by 2D or 3D Convolutional
Networks. We argue that several actions are characterized by their context,
such as relevant objects and actors present in the video. To this end, we
introduce an architecture based on self-attention and Graph Convolutional
Networks in order to model contextual cues, such as actor-actor and
actor-object interactions, to improve human action detection in video. We are
interested in achieving this in a weakly-supervised setting, i.e. using as less
annotations as possible in terms of action bounding boxes. Our model aids
explainability by visualizing the learned context as an attention map, even for
actions and objects unseen during training. We evaluate how well our model
highlights the relevant context by introducing a quantitative metric based on
recall of objects retrieved by attention maps. Our model relies on a 3D
convolutional RGB stream, and does not require expensive optical flow
computation. We evaluate our models on the DALY dataset, which consists of
human-object interaction actions. Experimental results show that our
contextualized approach outperforms a baseline action detection approach by
more than 2 points in Video-mAP. Code is available at
\url{https://github.com/micts/acgcn}
Related papers
- Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network [2.223052975765005]
We propose a novel Pyramid Graph Convolutional Network (PGCN) to automatically recognize human-object interaction.
The system represents the 2D or 3D spatial relation of human and objects from the detection results in video data as a graph.
We evaluate our model on two challenging datasets in the field of human-object interaction recognition.
arXiv Detail & Related papers (2024-10-10T13:39:17Z) - Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking [59.87033229815062]
Articulated object manipulation requires precise object interaction, where the object's axis must be carefully considered.
Previous research employed interactive perception for manipulating articulated objects, but typically, open-loop approaches often suffer from overlooking the interaction dynamics.
We present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds.
arXiv Detail & Related papers (2024-09-24T17:59:56Z) - ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos [4.736059095502584]
This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition.
We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations.
arXiv Detail & Related papers (2024-04-09T12:09:56Z) - A Hierarchical Graph-based Approach for Recognition and Description
Generation of Bimanual Actions in Videos [3.7486111821201287]
This study describes a novel method, integrating graph based modeling with layered hierarchical attention mechanisms.
The complexity of our approach is empirically tested using several 2D and 3D datasets.
arXiv Detail & Related papers (2023-10-01T13:45:48Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Efficient Spatialtemporal Context Modeling for Action Recognition [42.30158166919919]
We propose a recurrent 3D criss-cross attention (RCCA-3D) module to model the dense long-range contextual information video for action recognition.
We model the relationship between points in the same line along the direction of horizon, vertical and depth at each time, which forms a 3D criss-cross structure.
Compared with the non-local method, the proposed RCCA-3D module reduces the number of parameters and FLOPs by 25% and 11% for the video context modeling.
arXiv Detail & Related papers (2021-03-20T14:48:12Z) - Where2Act: From Pixels to Actions for Articulated 3D Objects [54.19638599501286]
We extract highly localized actionable information related to elementary actions such as pushing or pulling for articulated objects with movable parts.
We propose a learning-from-interaction framework with an online data sampling strategy that allows us to train the network in simulation.
Our learned models even transfer to real-world data.
arXiv Detail & Related papers (2021-01-07T18:56:38Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.