Regional Attention with Architecture-Rebuilt 3D Network for RGB-D
Gesture Recognition
- URL: http://arxiv.org/abs/2102.05348v1
- Date: Wed, 10 Feb 2021 09:36:00 GMT
- Title: Regional Attention with Architecture-Rebuilt 3D Network for RGB-D
Gesture Recognition
- Authors: Benjia Zhou, Yunan Li and Jun Wan
- Abstract summary: We propose a regional attention with architecture-rebuilt 3D network (RAAR3DNet) for gesture recognition.
We replace the fixed Inception modules with the automatically rebuilt structure through the network via Neural Architecture Search (NAS)
It enables the network to capture different levels of feature representations at different layers more adaptively.
- Score: 7.475025465262353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human gesture recognition has drawn much attention in the area of computer
vision. However, the performance of gesture recognition is always influenced by
some gesture-irrelevant factors like the background and the clothes of
performers. Therefore, focusing on the regions of hand/arm is important to the
gesture recognition. Meanwhile, a more adaptive architecture-searched network
structure can also perform better than the block-fixed ones like Resnet since
it increases the diversity of features in different stages of the network
better. In this paper, we propose a regional attention with
architecture-rebuilt 3D network (RAAR3DNet) for gesture recognition. We replace
the fixed Inception modules with the automatically rebuilt structure through
the network via Neural Architecture Search (NAS), owing to the different shape
and representation ability of features in the early, middle, and late stage of
the network. It enables the network to capture different levels of feature
representations at different layers more adaptively. Meanwhile, we also design
a stackable regional attention module called dynamic-static Attention (DSA),
which derives a Gaussian guidance heatmap and dynamic motion map to highlight
the hand/arm regions and the motion information in the spatial and temporal
domains, respectively. Extensive experiments on two recent large-scale RGB-D
gesture datasets validate the effectiveness of the proposed method and show it
outperforms state-of-the-art methods. The codes of our method are available at:
https://github.com/zhoubenjia/RAAR3DNet.
Related papers
- WiFi-based Cross-Domain Gesture Recognition Using Attention Mechanism [61.79272554643873]
We propose a gesture recognition network that integrates a multi-semantic attention mechanism with a self-attention-based channel mechanism.<n>The results show that it not only maintains high in-domain accuracy of 99.72%, but also achieves high performance in cross-domain recognition of 97.61%.
arXiv Detail & Related papers (2025-12-04T07:09:13Z) - GCRPNet: Graph-Enhanced Contextual and Regional Perception Network for Salient Object Detection in Optical Remote Sensing Images [68.33481681452675]
We propose a graph-enhanced contextual and regional perception network (GCRPNet)<n>It builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation.<n>It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information.
arXiv Detail & Related papers (2025-08-14T11:31:43Z) - ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic
Reconstruction [62.599588577671796]
We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames.
Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality.
arXiv Detail & Related papers (2023-11-29T20:30:18Z) - SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and
Quasi-Planar Segmentation [53.83313235792596]
We present a new methodology for real-time semantic mapping from RGB-D sequences.
It combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping.
Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems.
arXiv Detail & Related papers (2023-06-28T22:36:44Z) - Spatio-Temporal Representation Factorization for Video-based Person
Re-Identification [55.01276167336187]
We propose Spatio-Temporal Representation Factorization module (STRF) for re-ID.
STRF is a flexible new computational unit that can be used in conjunction with most existing 3D convolutional neural network architectures for re-ID.
We empirically show that STRF improves performance of various existing baseline architectures while demonstrating new state-of-the-art results.
arXiv Detail & Related papers (2021-07-25T19:29:37Z) - Joint Learning of Neural Transfer and Architecture Adaptation for Image
Recognition [77.95361323613147]
Current state-of-the-art visual recognition systems rely on pretraining a neural network on a large-scale dataset and finetuning the network weights on a smaller dataset.
In this work, we prove that dynamically adapting network architectures tailored for each domain task along with weight finetuning benefits in both efficiency and effectiveness.
Our method can be easily generalized to an unsupervised paradigm by replacing supernet training with self-supervised learning in the source domain tasks and performing linear evaluation in the downstream tasks.
arXiv Detail & Related papers (2021-03-31T08:15:17Z) - Neural-Pull: Learning Signed Distance Functions from Point Clouds by
Learning to Pull Space onto Surfaces [68.12457459590921]
Reconstructing continuous surfaces from 3D point clouds is a fundamental operation in 3D geometry processing.
We introduce textitNeural-Pull, a new approach that is simple and leads to high quality SDFs.
arXiv Detail & Related papers (2020-11-26T23:18:10Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Directional Temporal Modeling for Action Recognition [24.805397801876687]
We introduce a channel independent directional convolution (CIDC) operation, which learns to model the temporal evolution among local features.
Our CIDC network can be attached to any activity recognition backbone network.
arXiv Detail & Related papers (2020-07-21T18:49:57Z) - Short-Term Temporal Convolutional Networks for Dynamic Hand Gesture
Recognition [23.054444026402738]
We present a multimodal gesture recognition method based on 3D densely convolutional networks (3D-DenseNets) and improved temporal convolutional networks (TCNs)
In spatial analysis, we adopt 3D-DenseNets to learn short-term-temporal features effectively.
In temporal analysis, we use TCNs to extract temporal features and employ improved Squeeze-and-Excitation Networks (SENets) to strengthen the representational power of temporal features from each TCNs' layers.
arXiv Detail & Related papers (2019-12-31T23:30:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.