Home Action Genome: Cooperative Compositional Action Understanding
- URL: http://arxiv.org/abs/2105.05226v1
- Date: Tue, 11 May 2021 17:42:47 GMT
- Title: Home Action Genome: Cooperative Compositional Action Understanding
- Authors: Nishant Rai, Haofeng Chen, Jingwei Ji, Rishi Desai, Kazuki Kozuka,
Shun Ishizaka, Ehsan Adeli, Juan Carlos Niebles
- Abstract summary: Existing research on action recognition treats activities as monolithic events occurring in videos.
Cooperative Compositional Action Understanding (CCAU) is a cooperative learning framework for hierarchical action recognition.
We demonstrate the utility of co-learning compositions in few-shot action recognition by achieving 28.6% mAP with just a single sample.
- Score: 33.69990813932372
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing research on action recognition treats activities as monolithic
events occurring in videos. Recently, the benefits of formulating actions as a
combination of atomic-actions have shown promise in improving action
understanding with the emergence of datasets containing such annotations,
allowing us to learn representations capturing this information. However, there
remains a lack of studies that extend action composition and leverage multiple
viewpoints and multiple modalities of data for representation learning. To
promote research in this direction, we introduce Home Action Genome (HOMAGE): a
multi-view action dataset with multiple modalities and view-points supplemented
with hierarchical activity and atomic action labels together with dense scene
composition labels. Leveraging rich multi-modal and multi-view settings, we
propose Cooperative Compositional Action Understanding (CCAU), a cooperative
learning framework for hierarchical action recognition that is aware of
compositional action elements. CCAU shows consistent performance improvements
across all modalities. Furthermore, we demonstrate the utility of co-learning
compositions in few-shot action recognition by achieving 28.6% mAP with just a
single sample.
Related papers
- Compositional Learning in Transformer-Based Human-Object Interaction
Detection [6.630793383852106]
Long-tailed distribution of labeled instances is a primary challenge in HOI detection.
Inspired by the nature of HOI triplets, some existing approaches adopt the idea of compositional learning.
We creatively propose a transformer-based framework for compositional HOI learning.
arXiv Detail & Related papers (2023-08-11T06:41:20Z) - Language-free Compositional Action Generation via Decoupling Refinement [67.50452446686725]
We introduce a novel framework to generate compositional actions without reliance on language auxiliaries.
Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement.
arXiv Detail & Related papers (2023-07-07T12:00:38Z) - COMPOSER: Compositional Learning of Group Activity in Videos [33.526331969279106]
Group Activity Recognition (GAR) detects the activity performed by a group of actors in a short video clip.
We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale.
COMPOSER achieves a new state-of-the-art 94.5% accuracy with the keypoint-only modality.
arXiv Detail & Related papers (2021-12-11T01:25:46Z) - Temporal Action Segmentation with High-level Complex Activity Labels [29.17792724210746]
We learn the action segments taking only the high-level activity labels as input.
We propose a novel action discovery framework that automatically discovers constituent actions in videos.
arXiv Detail & Related papers (2021-08-15T09:50:42Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - Interactive Fusion of Multi-level Features for Compositional Activity
Recognition [100.75045558068874]
We present a novel framework that accomplishes this goal by interactive fusion.
We implement the framework in three steps, namely, positional-to-appearance feature extraction, semantic feature interaction, and semantic-to-positional prediction.
We evaluate our approach on two action recognition datasets, Something-Something and Charades.
arXiv Detail & Related papers (2020-12-10T14:17:18Z) - SAFCAR: Structured Attention Fusion for Compositional Action Recognition [47.43959215267547]
We develop and test a novel Structured Attention Fusion (SAF) self-attention mechanism to combine information from object detections.
We show that our approach recognizes novel verb-noun compositions more effectively than current state of the art systems.
We validate our approach on the challenging Something-Else tasks from the Something-Something-V2 dataset.
arXiv Detail & Related papers (2020-12-03T17:45:01Z) - Pose And Joint-Aware Action Recognition [87.4780883700755]
We present a new model for joint-based action recognition, which first extracts motion features from each joint separately through a shared motion encoder.
Our joint selector module re-weights the joint information to select the most discriminative joints for the task.
We show large improvements over the current state-of-the-art joint-based approaches on JHMDB, HMDB, Charades, AVA action recognition datasets.
arXiv Detail & Related papers (2020-10-16T04:43:34Z) - Collaborative Attention Mechanism for Multi-View Action Recognition [75.33062629093054]
We propose a collaborative attention mechanism (CAM) for solving the multi-view action recognition problem.
The proposed CAM detects the attention differences among multi-view, and adaptively integrates frame-level information to benefit each other.
Experiments on four action datasets illustrate the proposed CAM achieves better results for each view and also boosts multi-view performance.
arXiv Detail & Related papers (2020-09-14T17:33:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.