A Hierarchical Graph-based Approach for Recognition and Description
Generation of Bimanual Actions in Videos
- URL: http://arxiv.org/abs/2310.00670v1
- Date: Sun, 1 Oct 2023 13:45:48 GMT
- Title: A Hierarchical Graph-based Approach for Recognition and Description
Generation of Bimanual Actions in Videos
- Authors: Fatemeh Ziaeetabar, Reza Safabakhsh, Saeedeh Momtazi, Minija
Tamosiunaite, Florentin W\"org\"otter
- Abstract summary: This study describes a novel method, integrating graph based modeling with layered hierarchical attention mechanisms.
The complexity of our approach is empirically tested using several 2D and 3D datasets.
- Score: 3.7486111821201287
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nuanced understanding and the generation of detailed descriptive content for
(bimanual) manipulation actions in videos is important for disciplines such as
robotics, human-computer interaction, and video content analysis. This study
describes a novel method, integrating graph based modeling with layered
hierarchical attention mechanisms, resulting in higher precision and better
comprehensiveness of video descriptions. To achieve this, we encode, first, the
spatio-temporal inter dependencies between objects and actions with scene
graphs and we combine this, in a second step, with a novel 3-level architecture
creating a hierarchical attention mechanism using Graph Attention Networks
(GATs). The 3-level GAT architecture allows recognizing local, but also global
contextual elements. This way several descriptions with different semantic
complexity can be generated in parallel for the same video clip, enhancing the
discriminative accuracy of action recognition and action description. The
performance of our approach is empirically tested using several 2D and 3D
datasets. By comparing our method to the state of the art we consistently
obtain better performance concerning accuracy, precision, and contextual
relevance when evaluating action recognition as well as description generation.
In a large set of ablation experiments we also assess the role of the different
components of our model. With our multi-level approach the system obtains
different semantic description depths, often observed in descriptions made by
different people, too. Furthermore, better insight into bimanual hand-object
interactions as achieved by our model may portend advancements in the field of
robotics, enabling the emulation of intricate human actions with heightened
precision.
Related papers
- Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network [2.223052975765005]
We propose a novel Pyramid Graph Convolutional Network (PGCN) to automatically recognize human-object interaction.
The system represents the 2D or 3D spatial relation of human and objects from the detection results in video data as a graph.
We evaluate our model on two challenging datasets in the field of human-object interaction recognition.
arXiv Detail & Related papers (2024-10-10T13:39:17Z) - ActNetFormer: Transformer-ResNet Hybrid Method for Semi-Supervised Action Recognition in Videos [4.736059095502584]
This work proposes a novel approach using Cross-Architecture Pseudo-Labeling with contrastive learning for semi-supervised action recognition.
We introduce a novel cross-architecture approach where 3D Convolutional Neural Networks (3D CNNs) and video transformers (VIT) are utilised to capture different aspects of action representations.
arXiv Detail & Related papers (2024-04-09T12:09:56Z) - Sim2Real Object-Centric Keypoint Detection and Description [40.58367357980036]
Keypoint detection and description play a central role in computer vision.
We propose the object-centric formulation, which requires further identifying which object each interest point belongs to.
We develop a sim2real contrastive learning mechanism that can generalize the model trained in simulation to real-world applications.
arXiv Detail & Related papers (2022-02-01T15:00:20Z) - Representing Videos as Discriminative Sub-graphs for Action Recognition [165.54738402505194]
We introduce a new design of sub-graphs to represent and encode theriminative patterns of each action in the videos.
We present MUlti-scale Sub-Earn Ling (MUSLE) framework that novelly builds space-time graphs and clusters into compact sub-graphs on each scale.
arXiv Detail & Related papers (2022-01-11T16:15:25Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Spot What Matters: Learning Context Using Graph Convolutional Networks
for Weakly-Supervised Action Detection [0.0]
We introduce an architecture based on self-attention and Convolutional Networks to improve human action detection in video.
Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training.
Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP.
arXiv Detail & Related papers (2021-07-28T21:37:18Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Relational Graph Learning on Visual and Kinematics Embeddings for
Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information.
The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.