Multi-dataset Training of Transformers for Robust Action Recognition
- URL: http://arxiv.org/abs/2209.12362v2
- Date: Tue, 27 Sep 2022 02:57:26 GMT
- Title: Multi-dataset Training of Transformers for Robust Action Recognition
- Authors: Junwei Liang, Enwei Zhang, Jun Zhang, Chunhua Shen
- Abstract summary: We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
- Score: 75.5695991766902
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study the task of robust feature representations, aiming to generalize
well on multiple datasets for action recognition. We build our method on
Transformers for its efficacy. Although we have witnessed great progress for
video action recognition in the past decade, it remains challenging yet
valuable how to train a single model that can perform well across multiple
datasets. Here, we propose a novel multi-dataset training paradigm, MultiTrain,
with the design of two new loss terms, namely informative loss and projection
loss, aiming to learn robust representations for action recognition. In
particular, the informative loss maximizes the expressiveness of the feature
embedding while the projection loss for each dataset mines the intrinsic
relations between classes across datasets. We verify the effectiveness of our
method on five challenging datasets, Kinetics-400, Kinetics-700,
Moments-in-Time, Activitynet and Something-something-v2 datasets. Extensive
experimental results show that our method can consistently improve the
state-of-the-art performance.
Related papers
- DVANet: Disentangling View and Action Features for Multi-View Action
Recognition [56.283944756315066]
We present a novel approach to multi-view action recognition where we guide learned action representations to be separated from view-relevant information in a video.
Our model and method of training significantly outperforms all other uni-modal models on four multi-view action recognition datasets.
arXiv Detail & Related papers (2023-12-10T01:19:48Z) - Leveraging the Power of Data Augmentation for Transformer-based Tracking [64.46371987827312]
We propose two data augmentation methods customized for tracking.
First, we optimize existing random cropping via a dynamic search radius mechanism and simulation for boundary samples.
Second, we propose a token-level feature mixing augmentation strategy, which enables the model against challenges like background interference.
arXiv Detail & Related papers (2023-09-15T09:18:54Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Weakly Supervised Two-Stage Training Scheme for Deep Video Fight
Detection Model [0.0]
Fight detection in videos is an emerging deep learning application with today's prevalence of surveillance systems and streaming media.
Previous work has largely relied on action recognition techniques to tackle this problem.
We design the fight detection model as a composition of an action-aware feature extractor and an anomaly score generator.
arXiv Detail & Related papers (2022-09-23T08:29:16Z) - Self-Supervised Human Activity Recognition with Localized Time-Frequency
Contrastive Representation Learning [16.457778420360537]
We propose a self-supervised learning solution for human activity recognition with smartphone accelerometer data.
We develop a model that learns strong representations from accelerometer signals, while reducing the model's reliance on class labels.
We evaluate the performance of the proposed solution on three datasets, namely MotionSense, HAPT, and HHAR.
arXiv Detail & Related papers (2022-08-26T22:47:18Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Decoupling the Role of Data, Attention, and Losses in Multimodal
Transformers [20.343814813409537]
We study three important factors which can impact the quality of learned representations: pretraining data, the attention mechanism, and loss functions.
By pretraining models on six datasets, we observe that dataset noise and language similarity to our downstream task are important indicators of model performance.
We show that successful contrastive losses used in the self-supervised learning literature do not yield similar performance gains when used in multimodal transformers.
arXiv Detail & Related papers (2021-01-31T20:36:41Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.