MS$^2$L: Multi-Task Self-Supervised Learning for Skeleton Based Action
Recognition
- URL: http://arxiv.org/abs/2010.05599v2
- Date: Wed, 14 Oct 2020 07:07:05 GMT
- Title: MS$^2$L: Multi-Task Self-Supervised Learning for Skeleton Based Action
Recognition
- Authors: Lilang Lin, Sijie Song, Wenhan Yan and Jiaying Liu
- Abstract summary: We integrate motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features from different aspects.
Our experiments on the NW-UCLA, NTU RGB+D, and PKUMMD datasets show remarkable performance for action recognition.
- Score: 36.74293548921099
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we address self-supervised representation learning from human
skeletons for action recognition. Previous methods, which usually learn feature
presentations from a single reconstruction task, may come across the
overfitting problem, and the features are not generalizable for action
recognition. Instead, we propose to integrate multiple tasks to learn more
general representations in a self-supervised manner. To realize this goal, we
integrate motion prediction, jigsaw puzzle recognition, and contrastive
learning to learn skeleton features from different aspects. Skeleton dynamics
can be modeled through motion prediction by predicting the future sequence. And
temporal patterns, which are critical for action recognition, are learned
through solving jigsaw puzzles. We further regularize the feature space by
contrastive learning. Besides, we explore different training strategies to
utilize the knowledge from self-supervised tasks for action recognition. We
evaluate our multi-task self-supervised learning approach with action
classifiers trained under different configurations, including unsupervised,
semi-supervised and fully-supervised settings. Our experiments on the NW-UCLA,
NTU RGB+D, and PKUMMD datasets show remarkable performance for action
recognition, demonstrating the superiority of our method in learning more
discriminative and general features. Our project website is available at
https://langlandslin.github.io/projects/MSL/.
Related papers
- Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond [19.074841631219233]
Self-supervised learning (SSL) has been proven effective for skeleton-based action understanding.
In this paper, we conduct a comprehensive survey on self-supervised skeleton-based action representation learning.
arXiv Detail & Related papers (2024-06-05T06:21:54Z) - An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Vision-Language Meets the Skeleton: Progressively Distillation with Cross-Modal Knowledge for 3D Action Representation Learning [20.34477942813382]
Skeleton-based action representation learning aims to interpret and understand human behaviors by encoding the skeleton sequences.
We introduce a novel skeleton-based training framework based on Cross-modal Contrastive learning.
Our method outperforms the previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-05-31T03:40:15Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - Accelerating exploration and representation learning with offline
pre-training [52.6912479800592]
We show that exploration and representation learning can be improved by separately learning two different models from a single offline dataset.
We show that learning a state representation using noise-contrastive estimation and a model of auxiliary reward can significantly improve the sample efficiency on the challenging NetHack benchmark.
arXiv Detail & Related papers (2023-03-31T18:03:30Z) - Identifying Auxiliary or Adversarial Tasks Using Necessary Condition
Analysis for Adversarial Multi-task Video Understanding [34.75145779372538]
We propose a generalized notion of multi-task learning by incorporating both auxiliary tasks that the model should perform well on and adversarial tasks that the model should not perform well on.
Our novel proposed framework, Adversarial Multi-Task Neural Networks (AMT), penalizes adversarial tasks, determined by NCA to be scene recognition.
We show that our approach improves accuracy by 3% and encourages the model to attend to action features instead of correlation-biasing scene features.
arXiv Detail & Related papers (2022-08-22T06:26:11Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - MT-Opt: Continuous Multi-Task Robotic Reinforcement Learning at Scale [103.7609761511652]
We show how a large-scale collective robotic learning system can acquire a repertoire of behaviors simultaneously.
New tasks can be continuously instantiated from previously learned tasks.
We train and evaluate our system on a set of 12 real-world tasks with data collected from 7 robots.
arXiv Detail & Related papers (2021-04-16T16:38:02Z) - Anomaly Detection in Video via Self-Supervised and Multi-Task Learning [113.81927544121625]
Anomaly detection in video is a challenging computer vision problem.
In this paper, we approach anomalous event detection in video through self-supervised and multi-task learning at the object level.
arXiv Detail & Related papers (2020-11-15T10:21:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.