Dual-AI: Dual-path Actor Interaction Learning for Group Activity
Recognition
- URL: http://arxiv.org/abs/2204.02148v2
- Date: Wed, 6 Apr 2022 12:34:28 GMT
- Title: Dual-AI: Dual-path Actor Interaction Learning for Group Activity
Recognition
- Authors: Mingfei Han, David Junhao Zhang, Yali Wang, Rui Yan, Lina Yao, Xiaojun
Chang, Yu Qiao
- Abstract summary: We propose a Dual-path Actor Interaction (DualAI) framework, which flexibly arranges spatial and temporal transformers.
We also introduce a novel Multi-scale Actor Contrastive Loss (MAC-Loss) between two interactive paths of Dual-AI.
Our Dual-AI can boost group activity recognition by fusing distinct discriminative features of different actors.
- Score: 103.62363658053557
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning spatial-temporal relation among multiple actors is crucial for group
activity recognition. Different group activities often show the diversified
interactions between actors in the video. Hence, it is often difficult to model
complex group activities from a single view of spatial-temporal actor
evolution. To tackle this problem, we propose a distinct Dual-path Actor
Interaction (DualAI) framework, which flexibly arranges spatial and temporal
transformers in two complementary orders, enhancing actor relations by
integrating merits from different spatiotemporal paths. Moreover, we introduce
a novel Multi-scale Actor Contrastive Loss (MAC-Loss) between two interactive
paths of Dual-AI. Via self-supervised actor consistency in both frame and video
levels, MAC-Loss can effectively distinguish individual actor representations
to reduce action confusion among different actors. Consequently, our Dual-AI
can boost group activity recognition by fusing such discriminative features of
different actors. To evaluate the proposed approach, we conduct extensive
experiments on the widely used benchmarks, including Volleyball, Collective
Activity, and NBA datasets. The proposed Dual-AI achieves state-of-the-art
performance on all these datasets. It is worth noting the proposed Dual-AI with
50% training data outperforms a number of recent approaches with 100% training
data. This confirms the generalization power of Dual-AI for group activity
recognition, even under the challenging scenarios of limited supervision.
Related papers
- Flow-Assisted Motion Learning Network for Weakly-Supervised Group Activity Recognition [21.482797499764093]
Weakly-Supervised Group Activity Recognition (WSGAR) aims to understand the activity performed together by a group of individuals with the video-level label and without actor-level labels.
We propose Flow-Assisted Motion Learning Network (Flaming-Net) for WSGAR, which consists of the motion-aware encoder to extract actor features.
We demonstrate that Flaming-Net new state-of-the-art WSGAR results on two benchmarks, including a 2.8%p higher MPCA score on the NBA dataset.
arXiv Detail & Related papers (2024-05-28T09:53:47Z) - The impact of Compositionality in Zero-shot Multi-label action recognition for Object-based tasks [4.971065912401385]
We propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition.
Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification.
We validate our method on the Charades dataset that includes a majority of object-based actions.
arXiv Detail & Related papers (2024-05-14T15:28:48Z) - Spatio-Temporal Proximity-Aware Dual-Path Model for Panoramic Activity Recognition [19.813895376349613]
Panoramic Activity Recognition (PAR) seeks to identify human activities across different scales.
Social Proximity-aware Dual-Path Network (S PDP-Net) based on two key design principles.
S PDP-Net achieves new state-of-the-art performance with 46.5% of overall F1 score on JRDB-PAR dataset.
arXiv Detail & Related papers (2024-03-21T03:56:24Z) - I$^2$MD: 3D Action Representation Learning with Inter- and Intra-modal
Mutual Distillation [147.2183428328396]
We introduce a general Inter- and Intra-modal Mutual Distillation (I$2$MD) framework.
In I$2$MD, we first re-formulate the cross-modal interaction as a Cross-modal Mutual Distillation (CMD) process.
To alleviate the interference of similar samples and exploit their underlying contexts, we further design the Intra-modal Mutual Distillation (IMD) strategy.
arXiv Detail & Related papers (2023-10-24T07:22:17Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Self-Regulated Learning for Egocentric Video Activity Anticipation [147.9783215348252]
Self-Regulated Learning (SRL) aims to regulate the intermediate representation consecutively to produce representation that emphasizes the novel information in the frame of the current time-stamp.
SRL sharply outperforms existing state-of-the-art in most cases on two egocentric video datasets and two third-person video datasets.
arXiv Detail & Related papers (2021-11-23T03:29:18Z) - Identity-aware Graph Memory Network for Action Detection [37.65846189707054]
We explicitly highlight the identity information of the actors in terms of both long-term and short-term context through a graph memory network.
Specifically, we propose the hierarchical graph neural network (IGNN) to comprehensively conduct long-term relation modeling.
We develop a dual attention module (DAM) to generate identity-aware constraint to reduce the influence of interference by the actors of different identities.
arXiv Detail & Related papers (2021-08-26T02:34:55Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z) - Actor-Transformers for Group Activity Recognition [43.60866347282833]
This paper strives to recognize individual actions and group activities from videos.
We propose an actor-transformer model able to learn and selectively extract information relevant for group activity recognition.
arXiv Detail & Related papers (2020-03-28T07:21:58Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.