Weakly-Supervised Action Localization and Action Recognition using
Global-Local Attention of 3D CNN
- URL: http://arxiv.org/abs/2012.09542v1
- Date: Thu, 17 Dec 2020 12:29:16 GMT
- Title: Weakly-Supervised Action Localization and Action Recognition using
Global-Local Attention of 3D CNN
- Authors: Novanto Yudistira, Muthu Subash Kavitha, Takio Kurita
- Abstract summary: 3D Convolutional Neural Network (3D CNN) captures spatial and temporal information on 3D data such as video sequences.
We propose two approaches to improve the visual explanations and classification in 3D CNN.
- Score: 4.924442315857227
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D Convolutional Neural Network (3D CNN) captures spatial and temporal
information on 3D data such as video sequences. However, due to the convolution
and pooling mechanism, the information loss seems unavoidable. To improve the
visual explanations and classification in 3D CNN, we propose two approaches; i)
aggregate layer-wise global to local (global-local) discrete gradients using
trained 3DResNext network, and ii) implement attention gating network to
improve the accuracy of the action recognition. The proposed approach intends
to show the usefulness of every layer termed as global-local attention in 3D
CNN via visual attribution, weakly-supervised action localization, and action
recognition. Firstly, the 3DResNext is trained and applied for action
classification using backpropagation concerning the maximum predicted class.
The gradients and activations of every layer are then up-sampled. Later,
aggregation is used to produce more nuanced attention, which points out the
most critical part of the predicted class's input videos. We use contour
thresholding of final attention for final localization. We evaluate spatial and
temporal action localization in trimmed videos using fine-grained visual
explanation via 3DCam. Experimental results show that the proposed approach
produces informative visual explanations and discriminative attention.
Furthermore, the action recognition via attention gating on each layer produces
better classification results than the baseline model.
Related papers
- Generalized Label-Efficient 3D Scene Parsing via Hierarchical Feature
Aligned Pre-Training and Region-Aware Fine-tuning [55.517000360348725]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
Experiments with both indoor and outdoor scenes demonstrated the effectiveness of our approach in both data-efficient learning and open-world few-shot learning.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - ALSTER: A Local Spatio-Temporal Expert for Online 3D Semantic
Reconstruction [62.599588577671796]
We propose an online 3D semantic segmentation method that incrementally reconstructs a 3D semantic map from a stream of RGB-D frames.
Unlike offline methods, ours is directly applicable to scenarios with real-time constraints, such as robotics or mixed reality.
arXiv Detail & Related papers (2023-11-29T20:30:18Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Anchor-Based Spatial-Temporal Attention Convolutional Networks for
Dynamic 3D Point Cloud Sequences [20.697745449159097]
Anchor-based Spatial-Temporal Attention Convolution operation (ASTAConv) is proposed in this paper to process dynamic 3D point cloud sequences.
The proposed convolution operation builds a regular receptive field around each point by setting several virtual anchors around each point.
The proposed method makes better use of the structured information within the local region, and learn spatial-temporal embedding features from dynamic 3D point cloud sequences.
arXiv Detail & Related papers (2020-12-20T07:35:37Z) - Self-supervised Video Representation Learning by Uncovering
Spatio-temporal Statistics [74.6968179473212]
This paper proposes a novel pretext task to address the self-supervised learning problem.
We compute a series of partitioning-temporal statistical summaries, such as the spatial location and dominant direction of the largest motion.
A neural network is built and trained to yield the statistical summaries given the video frames as inputs.
arXiv Detail & Related papers (2020-08-31T08:31:56Z) - DH3D: Deep Hierarchical 3D Descriptors for Robust Large-Scale 6DoF
Relocalization [56.15308829924527]
We propose a Siamese network that jointly learns 3D local feature detection and description directly from raw 3D points.
For detecting 3D keypoints we predict the discriminativeness of the local descriptors in an unsupervised manner.
Experiments on various benchmarks demonstrate that our method achieves competitive results for both global point cloud retrieval and local point cloud registration.
arXiv Detail & Related papers (2020-07-17T20:21:22Z) - D3Feat: Joint Learning of Dense Detection and Description of 3D Local
Features [51.04841465193678]
We leverage a 3D fully convolutional network for 3D point clouds.
We propose a novel and practical learning mechanism that densely predicts both a detection score and a description feature for each 3D point.
Our method achieves state-of-the-art results in both indoor and outdoor scenarios.
arXiv Detail & Related papers (2020-03-06T12:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.