HAA4D: Few-Shot Human Atomic Action Recognition via 3D Spatio-Temporal
Skeletal Alignment
- URL: http://arxiv.org/abs/2202.07308v1
- Date: Tue, 15 Feb 2022 10:55:21 GMT
- Title: HAA4D: Few-Shot Human Atomic Action Recognition via 3D Spatio-Temporal
Skeletal Alignment
- Authors: Mu-Ruei Tseng, Abhishek Gupta, Chi-Keung Tang, Yu-Wing Tai
- Abstract summary: This paper proposes a new 4D HAA4D dataset which consists of more than 3,300 videos in 300 human atomic action classes.
The choice of atomic actions makes annotation even easier, because each video clip lasts for only a few seconds.
All training and testing 3D skeletons in HAA4D are globally aligned, using a deep alignment model to the same global space.
- Score: 62.77491613638775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human actions involve complex pose variations and their 2D projections can be
highly ambiguous. Thus 3D spatio-temporal or 4D (i.e., 3D+T) human skeletons,
which are photometric and viewpoint invariant, are an excellent alternative to
2D+T skeletons/pixels to improve action recognition accuracy. This paper
proposes a new 4D dataset HAA4D which consists of more than 3,300 RGB videos in
300 human atomic action classes. HAA4D is clean, diverse, class-balanced where
each class is viewpoint-balanced with the use of 4D skeletons, in which as few
as one 4D skeleton per class is sufficient for training a deep recognition
model. Further, the choice of atomic actions makes annotation even easier,
because each video clip lasts for only a few seconds. All training and testing
3D skeletons in HAA4D are globally aligned, using a deep alignment model to the
same global space, making each skeleton face the negative z-direction. Such
alignment makes matching skeletons more stable by reducing intraclass
variations and thus with fewer training samples per class needed for action
recognition. Given the high diversity and skeletal alignment in HAA4D, we
construct the first baseline few-shot 4D human atomic action recognition
network without bells and whistles, which produces comparable or higher
performance than relevant state-of-the-art techniques relying on embedded space
encoding without explicit skeletal alignment, using the same small number of
training samples of unseen classes.
Related papers
- Segment Any 4D Gaussians [69.53172192552508]
We propose Segment Any 4D Gaussians (SA4D) to segment anything in the 4D digital world based on 4D Gaussians.
SA4D achieves precise, high-quality segmentation within seconds in 4D Gaussians and shows the ability to remove, recolor, compose, and render high-quality anything masks.
arXiv Detail & Related papers (2024-07-05T13:44:15Z) - NSM4D: Neural Scene Model Based Online 4D Point Cloud Sequence
Understanding [20.79861588128133]
We introduce a generic online 4D perception paradigm called NSM4D.
NSM4D serves as a plug-and-play strategy that can be adapted to existing 4D backbones.
We demonstrate significant improvements on various online perception benchmarks in indoor and outdoor settings.
arXiv Detail & Related papers (2023-10-12T13:42:49Z) - Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Cloud
Recognition [108.07591240357306]
We tackle the data scarcity challenge in few-shot point cloud recognition of 3D objects by using a joint prediction from a conventional 3D model and a well-trained 2D model.
We find out the crux is the less effective training for the ''joint hard samples'', which have high confidence prediction on different wrong labels.
Our proposed invariant training strategy, called InvJoint, does not only emphasize the training more on the hard samples, but also seeks the invariance between the conflicting 2D and 3D ambiguous predictions.
arXiv Detail & Related papers (2023-08-18T17:43:12Z) - Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery from
Sparse Image Ensemble [72.3681707384754]
Hi-LASSIE performs 3D articulated reconstruction from only 20-30 online images in the wild without any user-defined shape or skeleton templates.
First, instead of relying on a manually annotated 3D skeleton, we automatically estimate a class-specific skeleton from the selected reference image.
Second, we improve the shape reconstructions with novel instance-specific optimization strategies that allow reconstructions to faithful fit on each instance.
arXiv Detail & Related papers (2022-12-21T14:31:33Z) - Optimising 2D Pose Representation: Improve Accuracy, Stability and
Generalisability Within Unsupervised 2D-3D Human Pose Estimation [7.294965109944706]
We show that the most optimal representation of a 2D pose is that of two independent segments, the torso and legs, with no shared features between each lifting network.
Our results show that the most optimal representation of a 2D pose is that of two independent segments, the torso and legs, with no shared features between each lifting network.
arXiv Detail & Related papers (2022-09-01T17:32:52Z) - Synthetic Training for Monocular Human Mesh Recovery [100.38109761268639]
This paper aims to estimate 3D mesh of multiple body parts with large-scale differences from a single RGB image.
The main challenge is lacking training data that have complete 3D annotations of all body parts in 2D images.
We propose a depth-to-scale (D2S) projection to incorporate the depth difference into the projection function to derive per-joint scale variants.
arXiv Detail & Related papers (2020-10-27T03:31:35Z) - PoseNet3D: Learning Temporally Consistent 3D Human Pose via Knowledge
Distillation [6.023152721616894]
PoseNet3D takes 2D joints as input and outputs 3D skeletons and SMPL body model parameters.
We first train a teacher network that outputs 3D skeletons, using only 2D poses for training. The teacher network distills its knowledge to a student network that predicts 3D pose in SMPL representation.
Results on Human3.6M dataset for 3D human pose estimation demonstrate that our approach reduces the 3D joint prediction error by 18% compared to previous unsupervised methods.
arXiv Detail & Related papers (2020-03-07T00:10:59Z) - V4D:4D Convolutional Neural Networks for Video-level Representation
Learning [58.548331848942865]
Most 3D CNNs for video representation learning are clip-based, and thus do not consider video-temporal evolution of features.
We propose Video-level 4D Conal Neural Networks, or V4D, to model long-range representation with 4D convolutions.
V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.
arXiv Detail & Related papers (2020-02-18T09:27:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.