Related papers: EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition

EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition

URL: http://arxiv.org/abs/2408.05421v1
Date: Sat, 10 Aug 2024 03:15:24 GMT
Title: EPAM-Net: An Efficient Pose-driven Attention-guided Multimodal Network for Video Action Recognition
Authors: Ahmed Abdelkawy, Asem Ali, Aly Farag,
Abstract summary: We present an efficient pose-driven attention-guided multimodal action recognition (EPAM-Net) for action recognition in videos. Specifically, we adapted X3D networks for both pose streams and network-temporal features from RGB videos and their skeleton sequences. Our model provides a 6.2-9.9-x reduction in FLOPs (floating-point operation, in number of multiply-adds) and a 9--9.6x reduction in the number of network parameters.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing multimodal-based human action recognition approaches are either computationally expensive, which limits their applicability in real-time scenarios, or fail to exploit the spatial temporal information of multiple data modalities. In this work, we present an efficient pose-driven attention-guided multimodal network (EPAM-Net) for action recognition in videos. Specifically, we adapted X3D networks for both RGB and pose streams to capture spatio-temporal features from RGB videos and their skeleton sequences. Then skeleton features are utilized to help the visual network stream focusing on key frames and their salient spatial regions using a spatial temporal attention block. Finally, the scores of the two streams of the proposed network are fused for final classification. The experimental results show that our method achieves competitive performance on NTU-D 60 and NTU RGB-D 120 benchmark datasets. Moreover, our model provides a 6.2--9.9x reduction in FLOPs (floating-point operation, in number of multiply-adds) and a 9--9.6x reduction in the number of network parameters. The code will be available at https://github.com/ahmed-nady/Multimodal-Action-Recognition.

Related papers

Rapid Salient Object Detection with Difference Convolutional Neural Networks [49.838283141381716]
This paper addresses the challenge of deploying salient object detection (SOD) on resource-constrained devices with real-time performance.<n>We propose an efficient network design that combines traditional wisdom on SOD and the representation power of modern CNNs.
arXiv Detail & Related papers (2025-07-01T20:41:05Z)
Low-Level Matters: An Efficient Hybrid Architecture for Robust Multi-frame Infrared Small Target Detection [5.048364655933007]
Multi-frame infrared small target detection plays a crucial role in low-altitude and maritime surveillance. The hybrid architecture combining CNNs and Transformers shows great promise for enhancing multi-frame IRSTD. We propose LVNet, a simple yet powerful hybrid architecture that redefines low-level feature learning hybrid frameworks.
arXiv Detail & Related papers (2025-03-04T02:53:25Z)
An Efficient 3D Convolutional Neural Network with Channel-wise, Spatial-grouped, and Temporal Convolutions [3.798710743290466]
We introduce a simple and very efficient 3D convolutional neural network for video action recognition. We evaluate the performance and efficiency of our proposed network on several video action recognition datasets.
arXiv Detail & Related papers (2025-03-02T08:47:06Z)
AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition [44.10959567844497]
This paper explores the unified formulation of spatial-temporal dynamic on top of the recently proposed AdaFocusV2 algorithm. AdaFocusV3 can be effectively trained by approximating the non-differentiable cropping operation with the computation of deep features.
arXiv Detail & Related papers (2022-09-27T15:30:52Z)
A Multi-Stage Duplex Fusion ConvNet for Aerial Scene Classification [4.061135251278187]
We develop ConvNet named multi-stage duplex fusion network (MSDF-Net) MSDF-Net consists of multi-stage structures with DFblock. Experiments are conducted on three widely-used aerial scene classification benchmarks.
arXiv Detail & Related papers (2022-03-29T09:27:53Z)
Action Keypoint Network for Efficient Video Recognition [63.48422805355741]
This paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net) AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of action keypoints and then transforms the video recognition into point cloud classification. Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.
arXiv Detail & Related papers (2022-01-17T09:35:34Z)
Specificity-preserving RGB-D Saliency Detection [103.3722116992476]
We propose a specificity-preserving network (SP-Net) for RGB-D saliency detection. Two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps. Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-08-18T14:14:22Z)
ACTION-Net: Multipath Excitation for Action Recognition [22.12530692711095]
We equip 2D CNNs with the proposed ACTION-Net to form a simple yet effective ACTION-Net with very limited extra computational cost. ACTION-Net is demonstrated by consistently outperforming 2D CNN counterparts on three backbones.
arXiv Detail & Related papers (2021-03-11T16:23:40Z)
Time and Frequency Network for Human Action Detection in Videos [6.78349879472022]
We propose an end-to-end network that considers the time and frequency features simultaneously, named TFNet. To obtain the action patterns, these two features are deeply fused under the attention mechanism.
arXiv Detail & Related papers (2021-03-08T11:42:05Z)
Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion. We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z)
ACDnet: An action detection network for real-time edge computing based on flow-guided feature approximation and memory aggregation [8.013823319651395]
ACDnet is a compact action detection network targeting real-time edge computing. It exploits the temporal coherence between successive video frames to approximate CNN features rather than naively extracting them. It can robustly achieve detection well above real-time (75 FPS)
arXiv Detail & Related papers (2021-02-26T14:06:31Z)
MVFNet: Multi-View Fusion Network for Efficient Video Recognition [79.92736306354576]
We introduce a multi-view fusion (MVF) module to exploit video complexity using separable convolution for efficiency. MVFNet can be thought of as a generalized video modeling framework.
arXiv Detail & Related papers (2020-12-13T06:34:18Z)
Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples. We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z)
Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition. The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections. The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling. Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z)
A Real-time Action Representation with Temporal Encoding and Deep Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
When CNNs Meet Random RNNs: Towards Multi-Level Analysis for RGB-D Object and Scene Recognition [10.796613905980609]
We propose a novel framework that extracts discriminative feature representations from multi-modal RGB-D images for object and scene recognition tasks. To cope with the high dimensionality of CNN activations, a random weighted pooling scheme has been proposed. Experiments verify that fully randomized structure in RNN stage encodes CNN activations to discriminative solid features successfully.
arXiv Detail & Related papers (2020-04-26T10:58:27Z)
Co-Saliency Spatio-Temporal Interaction Network for Person Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos. It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions. Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
Temporal Pyramid Network for Action Recognition [129.12076009042622]
We propose a generic Temporal Pyramid Network (TPN) at the feature-level, which can be flexibly integrated into 2D or 3D backbone networks. TPN shows consistent improvements over other challenging baselines on several action recognition datasets.
arXiv Detail & Related papers (2020-04-07T17:17:23Z)
Real-Time High-Performance Semantic Image Segmentation of Urban Street Scenes [98.65457534223539]
We propose a real-time high-performance DCNN-based method for robust semantic segmentation of urban street scenes. The proposed method achieves the accuracy of 73.6% and 68.0% mean Intersection over Union (mIoU) with the inference speed of 51.0 fps and 39.3 fps.
arXiv Detail & Related papers (2020-03-11T08:45:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.