Learning Human Action Recognition Representations Without Real Humans
- URL: http://arxiv.org/abs/2311.06231v1
- Date: Fri, 10 Nov 2023 18:38:14 GMT
- Title: Learning Human Action Recognition Representations Without Real Humans
- Authors: Howard Zhong, Samarth Mishra, Donghyun Kim, SouYoung Jin, Rameswar
Panda, Hilde Kuehne, Leonid Karlinsky, Venkatesh Saligrama, Aude Oliva,
Rogerio Feris
- Abstract summary: We present a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model.
We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks.
Our approach outperforms previous baselines by up to 5%.
- Score: 66.61527869763819
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training on massive video datasets has become essential to achieve high
action recognition performance on smaller downstream datasets. However, most
large-scale video datasets contain images of people and hence are accompanied
with issues related to privacy, ethics, and data protection, often preventing
them from being publicly shared for reproducible research. Existing work has
attempted to alleviate these problems by blurring faces, downsampling videos,
or training on synthetic data. On the other hand, analysis on the
transferability of privacy-preserving pre-trained models to downstream tasks
has been limited. In this work, we study this problem by first asking the
question: can we pre-train models for human action recognition with data that
does not include real humans? To this end, we present, for the first time, a
benchmark that leverages real-world videos with humans removed and synthetic
data containing virtual humans to pre-train a model. We then evaluate the
transferability of the representation learned on this data to a diverse set of
downstream action recognition benchmarks. Furthermore, we propose a novel
pre-training strategy, called Privacy-Preserving MAE-Align, to effectively
combine synthetic data and human-removed real data. Our approach outperforms
previous baselines by up to 5% and closes the performance gap between human and
no-human action recognition representations on downstream tasks, for both
linear probing and fine-tuning. Our benchmark, code, and models are available
at https://github.com/howardzh01/PPMA .
Related papers
- Uncovering Hidden Subspaces in Video Diffusion Models Using Re-Identification [6.408114351192012]
We show that models trained on synthetic data for specific downstream tasks still perform worse than those trained on real data.
This discrepancy may be partly due to the sampling space being a subspace of the training videos.
In this paper, we first show that training privacy-preserving models in latent space is computationally more efficient and generalize better.
arXiv Detail & Related papers (2024-11-07T18:32:00Z) - Redefining Data Pairing for Motion Retargeting Leveraging a Human Body Prior [4.5409191511532505]
MR HuBo(Motion Retargeting leveraging a HUman BOdy prior) is a cost-effective and convenient method to collect high-quality upper body paired robot, human> pose data.
We also present a two-stage motion neural network that can be trained via supervised learning on a large amount of paired data.
arXiv Detail & Related papers (2024-09-20T04:32:54Z) - Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets.
We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos.
Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z) - Training Robust Deep Physiological Measurement Models with Synthetic
Video-based Data [11.31971398273479]
We propose measures to add real-world noise to synthetic physiological signals and corresponding facial videos.
Our results show that we were able to reduce the average MAE from 6.9 to 2.0.
arXiv Detail & Related papers (2023-11-09T13:55:45Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - Video-based Pose-Estimation Data as Source for Transfer Learning in
Human Activity Recognition [71.91734471596433]
Human Activity Recognition (HAR) using on-body devices identifies specific human actions in unconstrained environments.
Previous works demonstrated that transfer learning is a good strategy for addressing scenarios with scarce data.
This paper proposes using datasets intended for human-pose estimation as a source for transfer learning.
arXiv Detail & Related papers (2022-12-02T18:19:36Z) - PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer
Vision [3.5694949627557846]
We release a human-centric synthetic data generator PeopleSansPeople.
It contains simulation-ready 3D human assets, a parameterized lighting and camera system, and generates 2D and 3D bounding box, instance and semantic segmentation, and COCO pose labels.
arXiv Detail & Related papers (2021-12-17T02:33:31Z) - Playing for 3D Human Recovery [88.91567909861442]
In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths.
Specifically, we contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine.
A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin.
arXiv Detail & Related papers (2021-10-14T17:49:42Z) - Efficient Realistic Data Generation Framework leveraging Deep
Learning-based Human Digitization [0.0]
The proposed method takes as input real background images and populates them with human figures in various poses.
A benchmarking and evaluation in the corresponding tasks shows that synthetic data can be effectively used as a supplement to real data.
arXiv Detail & Related papers (2021-06-28T08:07:31Z) - Hidden Footprints: Learning Contextual Walkability from 3D Human Trails [70.01257397390361]
Current datasets only tell you where people are, not where they could be.
We first augment the set of valid, labeled walkable regions by propagating person observations between images, utilizing 3D information to create what we call hidden footprints.
We devise a training strategy designed for such sparse labels, combining a class-balanced classification loss with a contextual adversarial loss.
arXiv Detail & Related papers (2020-08-19T23:19:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.