Heatmap Pooling Network for Action Recognition from RGB Videos
- URL: http://arxiv.org/abs/2512.03837v1
- Date: Wed, 03 Dec 2025 14:36:59 GMT
- Title: Heatmap Pooling Network for Action Recognition from RGB Videos
- Authors: Mengyuan Liu, Jinfu Liu, Yongkang Jiang, Bin He,
- Abstract summary: Existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs.<n>We propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos.
- Score: 35.521306424947724
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human action recognition (HAR) in videos has garnered widespread attention due to the rich information in RGB videos. Nevertheless, existing methods for extracting deep features from RGB videos face challenges such as information redundancy, susceptibility to noise and high storage costs. To address these issues and fully harness the useful information in videos, we propose a novel heatmap pooling network (HP-Net) for action recognition from videos, which extracts information-rich, robust and concise pooled features of the human body in videos through a feedback pooling module. The extracted pooled features demonstrate obvious performance advantages over the previously obtained pose data and heatmap features from videos. In addition, we design a spatial-motion co-learning module and a text refinement modulation module to integrate the extracted pooled features with other multimodal data, enabling more robust action recognition. Extensive experiments on several benchmarks namely NTU RGB+D 60, NTU RGB+D 120, Toyota-Smarthome and UAV-Human consistently verify the effectiveness of our HP-Net, which outperforms the existing human action recognition methods. Our code is publicly available at: https://github.com/liujf69/HPNet-Action.
Related papers
- Human Activity Recognition using RGB-Event based Sensors: A Multi-modal Heat Conduction Model and A Benchmark Dataset [65.76480665062363]
Human Activity Recognition primarily relied on traditional RGB cameras to achieve high-performance activity recognition.<n>Challenges in real-world scenarios, such as insufficient lighting and rapid movements, inevitably degrade the performance of RGB cameras.<n>In this work, we rethink human activity recognition by combining the RGB and event cameras.
arXiv Detail & Related papers (2025-04-08T09:14:24Z) - SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval [82.51117533271517]
Previous works typically only encode RGB videos to obtain high-level semantic features.
Existing RGB-based sign retrieval works suffer from the huge memory cost of dense visual data embedding in end-to-end training.
We propose a novel sign language representation framework called Semantically Enhanced Dual-Stream.
arXiv Detail & Related papers (2024-07-23T11:31:11Z) - ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection [51.16181295385818]
We first collect an annotated RGB-D video SODOD (DSOD-100) dataset, which contains 100 videos within a total of 9,362 frames.
All the frames in each video are manually annotated to a high-quality saliency annotation.
We propose a new baseline model, named attentive triple-fusion network (ATF-Net) for RGB-D salient object detection.
arXiv Detail & Related papers (2024-06-18T12:09:43Z) - Adversarial Memory Networks for Action Prediction [95.09968654228372]
Action prediction aims to infer the forthcoming human action with partially-observed videos.
We propose adversarial memory networks (AMemNet) to generate the "full video" feature conditioning on a partial video query.
arXiv Detail & Related papers (2021-12-18T08:16:21Z) - Specificity-preserving RGB-D Saliency Detection [103.3722116992476]
We propose a specificity-preserving network (SP-Net) for RGB-D saliency detection.
Two modality-specific networks and a shared learning network are adopted to generate individual and shared saliency maps.
Experiments on six benchmark datasets demonstrate that our SP-Net outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2021-08-18T14:14:22Z) - Infrared and 3D skeleton feature fusion for RGB-D action recognition [0.30458514384586394]
We propose a modular network combining skeleton and infrared data.
A 2D convolutional network (CNN) is used as a pose module to extract features from skeleton data.
A 3D CNN is used as an infrared module to extract visual cues from videos.
arXiv Detail & Related papers (2020-02-28T17:42:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.