Skeleton-DML: Deep Metric Learning for Skeleton-Based One-Shot Action
Recognition
- URL: http://arxiv.org/abs/2012.13823v2
- Date: Mon, 8 Mar 2021 14:33:17 GMT
- Title: Skeleton-DML: Deep Metric Learning for Skeleton-Based One-Shot Action
Recognition
- Authors: Raphael Memmesheimer, Simon H\"aring, Nick Theisen, Dietrich Paulus
- Abstract summary: One-shot action recognition allows the recognition of human-performed actions with only a single training example.
This can influence human-robot-interaction positively by enabling the robot to react to previously unseen behaviour.
We propose a novel image-based skeleton representation that performs well in a metric learning setting.
- Score: 0.5161531917413706
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: One-shot action recognition allows the recognition of human-performed actions
with only a single training example. This can influence human-robot-interaction
positively by enabling the robot to react to previously unseen behaviour. We
formulate the one-shot action recognition problem as a deep metric learning
problem and propose a novel image-based skeleton representation that performs
well in a metric learning setting. Therefore, we train a model that projects
the image representations into an embedding space. In embedding space the
similar actions have a low euclidean distance while dissimilar actions have a
higher distance. The one-shot action recognition problem becomes a
nearest-neighbor search in a set of activity reference samples. We evaluate the
performance of our proposed representation against a variety of other
skeleton-based image representations. In addition, we present an ablation study
that shows the influence of different embedding vector sizes, losses and
augmentation. Our approach lifts the state-of-the-art by 3.3% for the one-shot
action recognition protocol on the NTU RGB+D 120 dataset under a comparable
training setup. With additional augmentation our result improved over 7.7%.
Related papers
- Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - On the Benefits of 3D Pose and Tracking for Human Action Recognition [77.07134833715273]
We show the benefits of using tracking and 3D poses for action recognition.
We propose a Lagrangian Action Recognition model by fusing 3D pose and contextualized appearance over tracklets.
Our method achieves state-of-the-art performance on the AVA v2.2 dataset.
arXiv Detail & Related papers (2023-04-03T17:59:49Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - AROS: Affordance Recognition with One-Shot Human Stances [0.0]
We present AROS, a one-shot learning approach that uses an explicit representation of interactions between human poses and 3D scenes.
Given a 3D mesh of a previously unseen scene, we can predict affordance locations that support the interactions and generate corresponding articulated 3D human bodies around them.
Results show that our one-shot approach outperforms data-intensive baselines by up to 80%.
arXiv Detail & Related papers (2022-10-21T04:29:21Z) - Hierarchical Compositional Representations for Few-shot Action
Recognition [51.288829293306335]
We propose a novel hierarchical compositional representations (HCR) learning approach for few-shot action recognition.
We divide a complicated action into several sub-actions by carefully designed hierarchical clustering.
We also adopt the Earth Mover's Distance in the transportation problem to measure the similarity between video samples in terms of sub-action representations.
arXiv Detail & Related papers (2022-08-19T16:16:59Z) - A Training Method For VideoPose3D With Ideology of Action Recognition [0.9949781365631559]
This research shows a faster and more flexible training method for VideoPose3D based on action recognition.
It can handle both action-oriented and common pose-estimation problems.
arXiv Detail & Related papers (2022-06-13T19:25:27Z) - Skeleton-Based Mutually Assisted Interacted Object Localization and
Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data.
Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - Learning View-Disentangled Human Pose Representation by Contrastive
Cross-View Mutual Information Maximization [33.36330493757669]
We introduce a novel representation learning method to disentangle pose-dependent as well as view-dependent factors from 2D human poses.
The method trains a network using cross-view mutual information (CV-MIM) which maximizes mutual information of the same pose performed from different viewpoints.
CV-MIM outperforms other competing methods by a large margin in the single-shot cross-view setting.
arXiv Detail & Related papers (2020-12-02T18:55:35Z) - View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose [36.384824115033304]
We propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses.
Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views.
arXiv Detail & Related papers (2020-10-23T17:58:35Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.