Related papers: Learning Human Action Recognition Representations Without Real Humans

Learning Human Action Recognition Representations Without Real Humans

URL: http://arxiv.org/abs/2311.06231v1
Date: Fri, 10 Nov 2023 18:38:14 GMT
Title: Learning Human Action Recognition Representations Without Real Humans
Authors: Howard Zhong, Samarth Mishra, Donghyun Kim, SouYoung Jin, Rameswar Panda, Hilde Kuehne, Leonid Karlinsky, Venkatesh Saligrama, Aude Oliva, Rogerio Feris
Abstract summary: We present a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model. We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks. Our approach outperforms previous baselines by up to 5%.
Score: 66.61527869763819
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-training on massive video datasets has become essential to achieve high action recognition performance on smaller downstream datasets. However, most large-scale video datasets contain images of people and hence are accompanied with issues related to privacy, ethics, and data protection, often preventing them from being publicly shared for reproducible research. Existing work has attempted to alleviate these problems by blurring faces, downsampling videos, or training on synthetic data. On the other hand, analysis on the transferability of privacy-preserving pre-trained models to downstream tasks has been limited. In this work, we study this problem by first asking the question: can we pre-train models for human action recognition with data that does not include real humans? To this end, we present, for the first time, a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model. We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks. Furthermore, we propose a novel pre-training strategy, called Privacy-Preserving MAE-Align, to effectively combine synthetic data and human-removed real data. Our approach outperforms previous baselines by up to 5% and closes the performance gap between human and no-human action recognition representations on downstream tasks, for both linear probing and fine-tuning. Our benchmark, code, and models are available at https://github.com/howardzh01/PPMA .

Related papers

Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning [50.76723760768117]
Existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos.<n>We find that human appearance can provide a straightforward cue to address these obstacles.<n>We propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws.
arXiv Detail & Related papers (2025-07-03T12:19:26Z)
Synthetic Human Action Video Data Generation with Pose Transfer [0.7366405857677227]
This paper proposes a method for generating synthetic human action video data using pose transfer.<n>We evaluate this method on the Toyota Smarthome and NTU RGB+D datasets and show that it improves performance in action recognition tasks.
arXiv Detail & Related papers (2025-06-11T05:52:39Z)
Human Body Restoration with One-Step Diffusion Model and A New Benchmark [74.66514054623669]
We propose a high-quality dataset automated cropping and filtering (HQ-ACF) pipeline. This pipeline leverages existing object detection datasets and other unlabeled images to automatically crop and filter high-quality human images. We also propose emphOSDHuman, a novel one-step diffusion model for human body restoration.
arXiv Detail & Related papers (2025-02-03T14:48:40Z)
OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation [27.516068877910254]
We introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs.
arXiv Detail & Related papers (2024-11-28T07:01:06Z)
Uncovering Hidden Subspaces in Video Diffusion Models Using Re-Identification [6.408114351192012]
We show that models trained on synthetic data for specific downstream tasks still perform worse than those trained on real data. This discrepancy may be partly due to the sampling space being a subspace of the training videos. In this paper, we first show that training privacy-preserving models in latent space is computationally more efficient and generalize better.
arXiv Detail & Related papers (2024-11-07T18:32:00Z)
Redefining Data Pairing for Motion Retargeting Leveraging a Human Body Prior [4.5409191511532505]
MR HuBo(Motion Retargeting leveraging a HUman BOdy prior) is a cost-effective and convenient method to collect high-quality upper body paired robot, human> pose data. We also present a two-stage motion neural network that can be trained via supervised learning on a large amount of paired data.
arXiv Detail & Related papers (2024-09-20T04:32:54Z)
Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z)
Training Robust Deep Physiological Measurement Models with Synthetic Video-based Data [11.31971398273479]
We propose measures to add real-world noise to synthetic physiological signals and corresponding facial videos. Our results show that we were able to reduce the average MAE from 6.9 to 2.0.
arXiv Detail & Related papers (2023-11-09T13:55:45Z)
Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data. It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z)
Video-based Pose-Estimation Data as Source for Transfer Learning in Human Activity Recognition [71.91734471596433]
Human Activity Recognition (HAR) using on-body devices identifies specific human actions in unconstrained environments. Previous works demonstrated that transfer learning is a good strategy for addressing scenarios with scarce data. This paper proposes using datasets intended for human-pose estimation as a source for transfer learning.
arXiv Detail & Related papers (2022-12-02T18:19:36Z)
PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision [3.5694949627557846]
We release a human-centric synthetic data generator PeopleSansPeople. It contains simulation-ready 3D human assets, a parameterized lighting and camera system, and generates 2D and 3D bounding box, instance and semantic segmentation, and COCO pose labels.
arXiv Detail & Related papers (2021-12-17T02:33:31Z)
Playing for 3D Human Recovery [88.91567909861442]
In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths. Specifically, we contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine. A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin.
arXiv Detail & Related papers (2021-10-14T17:49:42Z)
Efficient Realistic Data Generation Framework leveraging Deep Learning-based Human Digitization [0.0]
The proposed method takes as input real background images and populates them with human figures in various poses. A benchmarking and evaluation in the corresponding tasks shows that synthetic data can be effectively used as a supplement to real data.
arXiv Detail & Related papers (2021-06-28T08:07:31Z)
Hidden Footprints: Learning Contextual Walkability from 3D Human Trails [70.01257397390361]
Current datasets only tell you where people are, not where they could be. We first augment the set of valid, labeled walkable regions by propagating person observations between images, utilizing 3D information to create what we call hidden footprints. We devise a training strategy designed for such sparse labels, combining a class-balanced classification loss with a contextual adversarial loss.
arXiv Detail & Related papers (2020-08-19T23:19:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.