Related papers: Regional Attention with Architecture-Rebuilt 3D Network for RGB-D Gesture Recognition

Regional Attention with Architecture-Rebuilt 3D Network for RGB-D Gesture Recognition

URL: http://arxiv.org/abs/2102.05348v1
Date: Wed, 10 Feb 2021 09:36:00 GMT
Title: Regional Attention with Architecture-Rebuilt 3D Network for RGB-D Gesture Recognition
Authors: Benjia Zhou, Yunan Li and Jun Wan
Abstract summary: We propose a regional attention with architecture-rebuilt 3D network (RAAR3DNet) for gesture recognition. We replace the fixed Inception modules with the automatically rebuilt structure through the network via Neural Architecture Search (NAS) It enables the network to capture different levels of feature representations at different layers more adaptively.
Score: 7.475025465262353
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human gesture recognition has drawn much attention in the area of computer vision. However, the performance of gesture recognition is always influenced by some gesture-irrelevant factors like the background and the clothes of performers. Therefore, focusing on the regions of hand/arm is important to the gesture recognition. Meanwhile, a more adaptive architecture-searched network structure can also perform better than the block-fixed ones like Resnet since it increases the diversity of features in different stages of the network better. In this paper, we propose a regional attention with architecture-rebuilt 3D network (RAAR3DNet) for gesture recognition. We replace the fixed Inception modules with the automatically rebuilt structure through the network via Neural Architecture Search (NAS), owing to the different shape and representation ability of features in the early, middle, and late stage of the network. It enables the network to capture different levels of feature representations at different layers more adaptively. Meanwhile, we also design a stackable regional attention module called dynamic-static Attention (DSA), which derives a Gaussian guidance heatmap and dynamic motion map to highlight the hand/arm regions and the motion information in the spatial and temporal domains, respectively. Extensive experiments on two recent large-scale RGB-D gesture datasets validate the effectiveness of the proposed method and show it outperforms state-of-the-art methods. The codes of our method are available at: https://github.com/zhoubenjia/RAAR3DNet.

Related papers

WiFi-based Cross-Domain Gesture Recognition Using Attention Mechanism [61.79272554643873]
We propose a gesture recognition network that integrates a multi-semantic attention mechanism with a self-attention-based channel mechanism.<n>The results show that it not only maintains high in-domain accuracy of 99.72%, but also achieves high performance in cross-domain recognition of 97.61%.
arXiv Detail & Related papers (2025-12-04T07:09:13Z)
GCRPNet: Graph-Enhanced Contextual and Regional Perception Network for Salient Object Detection in Optical Remote Sensing Images [68.33481681452675]
We propose a graph-enhanced contextual and regional perception network (GCRPNet)<n>It builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation.<n>It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information.
arXiv Detail & Related papers (2025-08-14T11:31:43Z)
ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic Reconstruction [62.599588577671796]
We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames. Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality.
arXiv Detail & Related papers (2023-11-29T20:30:18Z)
SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and Quasi-Planar Segmentation [53.83313235792596]
We present a new methodology for real-time semantic mapping from RGB-D sequences. It combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping. Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems.
arXiv Detail & Related papers (2023-06-28T22:36:44Z)
Spatio-Temporal Representation Factorization for Video-based Person Re-Identification [55.01276167336187]
We propose Spatio-Temporal Representation Factorization module (STRF) for re-ID. STRF is a flexible new computational unit that can be used in conjunction with most existing 3D convolutional neural network architectures for re-ID. We empirically show that STRF improves performance of various existing baseline architectures while demonstrating new state-of-the-art results.
arXiv Detail & Related papers (2021-07-25T19:29:37Z)
Joint Learning of Neural Transfer and Architecture Adaptation for Image Recognition [77.95361323613147]
Current state-of-the-art visual recognition systems rely on pretraining a neural network on a large-scale dataset and finetuning the network weights on a smaller dataset. In this work, we prove that dynamically adapting network architectures tailored for each domain task along with weight finetuning benefits in both efficiency and effectiveness. Our method can be easily generalized to an unsupervised paradigm by replacing supernet training with self-supervised learning in the source domain tasks and performing linear evaluation in the downstream tasks.
arXiv Detail & Related papers (2021-03-31T08:15:17Z)
Neural-Pull: Learning Signed Distance Functions from Point Clouds by Learning to Pull Space onto Surfaces [68.12457459590921]
Reconstructing continuous surfaces from 3D point clouds is a fundamental operation in 3D geometry processing. We introduce textitNeural-Pull, a new approach that is simple and leads to high quality SDFs.
arXiv Detail & Related papers (2020-11-26T23:18:10Z)
Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition. The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections. The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z)
Directional Temporal Modeling for Action Recognition [24.805397801876687]
We introduce a channel independent directional convolution (CIDC) operation, which learns to model the temporal evolution among local features. Our CIDC network can be attached to any activity recognition backbone network.
arXiv Detail & Related papers (2020-07-21T18:49:57Z)
Short-Term Temporal Convolutional Networks for Dynamic Hand Gesture Recognition [23.054444026402738]
We present a multimodal gesture recognition method based on 3D densely convolutional networks (3D-DenseNets) and improved temporal convolutional networks (TCNs) In spatial analysis, we adopt 3D-DenseNets to learn short-term-temporal features effectively. In temporal analysis, we use TCNs to extract temporal features and employ improved Squeeze-and-Excitation Networks (SENets) to strengthen the representational power of temporal features from each TCNs' layers.
arXiv Detail & Related papers (2019-12-31T23:30:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.