ProFormer: Learning Data-efficient Representations of Body Movement with
Prototype-based Feature Augmentation and Visual Transformers
- URL: http://arxiv.org/abs/2202.11423v1
- Date: Wed, 23 Feb 2022 11:11:54 GMT
- Title: ProFormer: Learning Data-efficient Representations of Body Movement with
Prototype-based Feature Augmentation and Visual Transformers
- Authors: Kunyu Peng, Alina Roitberg, Kailun Yang, Jiaming Zhang, Rainer
Stiefelhagen
- Abstract summary: Methods for data-efficient recognition from body poses increasingly leverage skeleton sequences structured as image-like arrays.
We look at this paradigm from the perspective of transformer networks, for the first time exploring visual transformers as data-efficient encoders of skeleton movement.
In our pipeline, body pose sequences cast as image-like representations are converted into patch embeddings and then passed to a visual transformer backbone optimized with deep metric learning.
- Score: 31.908276711898548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically understanding human behaviour allows household robots to
identify the most critical needs and plan how to assist the human according to
the current situation. However, the majority of such methods are developed
under the assumption that a large amount of labelled training examples is
available for all concepts-of-interest. Robots, on the other hand, operate in
constantly changing unstructured environments, and need to adapt to novel
action categories from very few samples. Methods for data-efficient recognition
from body poses increasingly leverage skeleton sequences structured as
image-like arrays and then used as input to convolutional neural networks. We
look at this paradigm from the perspective of transformer networks, for the
first time exploring visual transformers as data-efficient encoders of skeleton
movement. In our pipeline, body pose sequences cast as image-like
representations are converted into patch embeddings and then passed to a visual
transformer backbone optimized with deep metric learning. Inspired by recent
success of feature enhancement methods in semi-supervised learning, we further
introduce ProFormer -- an improved training strategy which uses soft-attention
applied on iteratively estimated action category prototypes used to augment the
embeddings and compute an auxiliary consistency loss. Extensive experiments
consistently demonstrate the effectiveness of our approach for one-shot
recognition from body poses, achieving state-of-the-art results on multiple
datasets and surpassing the best published approach on the challenging NTU-120
one-shot benchmark by 1.84%. Our code will be made publicly available at
https://github.com/KPeng9510/ProFormer.
Related papers
- Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Rethinking Transformers Pre-training for Multi-Spectral Satellite
Imagery [78.43828998065071]
Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks.
Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data.
In this paper, we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities.
arXiv Detail & Related papers (2024-03-08T16:18:04Z) - Data Augmentation and Transfer Learning Approaches Applied to Facial
Expressions Recognition [0.3481985817302898]
We propose a novel data augmentation technique that improves the performances in the recognition task.
We build from scratch GAN models able to generate new synthetic images for each emotion type.
On the augmented datasets we fine tune pretrained convolutional neural networks with different architectures.
arXiv Detail & Related papers (2024-02-15T14:46:03Z) - MENTOR: Human Perception-Guided Pretraining for Increased Generalization [5.596752018167751]
We introduce MENTOR (huMan pErceptioN-guided preTraining fOr increased geneRalization)
We train an autoencoder to learn human saliency maps given an input image, without class labels.
We remove the decoder part, add a classification layer on top of the encoder, and fine-tune this new model conventionally.
arXiv Detail & Related papers (2023-10-30T13:50:44Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - BTranspose: Bottleneck Transformers for Human Pose Estimation with
Self-Supervised Pre-Training [0.304585143845864]
In this paper, we consider the recently proposed Bottleneck Transformers, which combine CNN and multi-head self attention (MHSA) layers effectively.
We consider different backbone architectures and pre-train them using the DINO self-supervised learning method.
Experiments show that our model achieves an AP of 76.4, which is competitive with other methods such as [1] and has fewer network parameters.
arXiv Detail & Related papers (2022-04-21T15:45:05Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - One to Many: Adaptive Instrument Segmentation via Meta Learning and
Dynamic Online Adaptation in Robotic Surgical Video [71.43912903508765]
MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery.
It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm.
It outperforms other state-of-the-art methods on two datasets.
arXiv Detail & Related papers (2021-03-24T05:02:18Z) - Domain Adaptive Robotic Gesture Recognition with Unsupervised
Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot.
It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture.
Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z) - Self-Supervised Human Activity Recognition by Augmenting Generative
Adversarial Networks [0.0]
This article proposes a novel approach for augmenting generative adversarial network (GAN) with a self-supervised task.
In the proposed method, input video frames are randomly transformed by different spatial transformations.
discriminator is encouraged to predict the applied transformation by introducing an auxiliary loss.
arXiv Detail & Related papers (2020-08-26T18:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.