EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition
- URL: http://arxiv.org/abs/2408.05421v1
- Date: Sat, 10 Aug 2024 03:15:24 GMT
- Title: EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition
- Authors: Ahmed Abdelkawy, Asem Ali, Aly Farag,
- Abstract summary: We present an efficient pose-driven attention-guided multimodal action recognition (EPAM-Net) for action recognition in videos.
Specifically, we adapted X3D networks for both pose streams and network-temporal features from RGB videos and their skeleton sequences.
Our model provides a 6.2-9.9-x reduction in FLOPs (floating-point operation, in number of multiply-adds) and a 9--9.6x reduction in the number of network parameters.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing multimodal-based human action recognition approaches are either computationally expensive, which limits their applicability in real-time scenarios, or fail to exploit the spatial temporal information of multiple data modalities. In this work, we present an efficient pose-driven attention-guided multimodal network (EPAM-Net) for action recognition in videos. Specifically, we adapted X3D networks for both RGB and pose streams to capture spatio-temporal features from RGB videos and their skeleton sequences. Then skeleton features are utilized to help the visual network stream focusing on key frames and their salient spatial regions using a spatial temporal attention block. Finally, the scores of the two streams of the proposed network are fused for final classification. The experimental results show that our method achieves competitive performance on NTU-D 60 and NTU RGB-D 120 benchmark datasets. Moreover, our model provides a 6.2--9.9x reduction in FLOPs (floating-point operation, in number of multiply-adds) and a 9--9.6x reduction in the number of network parameters. The code will be available at https://github.com/ahmed-nady/Multimodal-Action-Recognition.
Related papers
- AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition [44.10959567844497]
This paper explores the unified formulation of spatial-temporal dynamic on top of the recently proposed AdaFocusV2 algorithm.
AdaFocusV3 can be effectively trained by approximating the non-differentiable cropping operation with the computation of deep features.
arXiv Detail & Related papers (2022-09-27T15:30:52Z) - A Multi-Stage Duplex Fusion ConvNet for Aerial Scene Classification [4.061135251278187]
We develop ConvNet named multi-stage duplex fusion network (MSDF-Net)
MSDF-Net consists of multi-stage structures with DFblock.
Experiments are conducted on three widely-used aerial scene classification benchmarks.
arXiv Detail & Related papers (2022-03-29T09:27:53Z) - Action Keypoint Network for Efficient Video Recognition [63.48422805355741]
This paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net)
AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of action keypoints and then transforms the video recognition into point cloud classification.
Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.
arXiv Detail & Related papers (2022-01-17T09:35:34Z) - Specificity-preserving RGB-D Saliency Detection [103.3722116992476]
We propose a specificity-preserving network (SP-Net) for RGB-D saliency detection.
Two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps.
Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-08-18T14:14:22Z) - Time and Frequency Network for Human Action Detection in Videos [6.78349879472022]
We propose an end-to-end network that considers the time and frequency features simultaneously, named TFNet.
To obtain the action patterns, these two features are deeply fused under the attention mechanism.
arXiv Detail & Related papers (2021-03-08T11:42:05Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z) - Temporal Pyramid Network for Action Recognition [129.12076009042622]
We propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks.
TPN shows consistent improvements over other challenging baselines on several action recognition datasets.
arXiv Detail & Related papers (2020-04-07T17:17:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.