Related papers: Creating a Large-scale Synthetic Dataset for Human Activity Recognition

Creating a Large-scale Synthetic Dataset for Human Activity Recognition

URL: http://arxiv.org/abs/2007.11118v1
Date: Tue, 21 Jul 2020 22:20:21 GMT
Title: Creating a Large-scale Synthetic Dataset for Human Activity Recognition
Authors: Ollie Matthews, Koki Ryu, Tarun Srivastava
Abstract summary: We use 3D rendering tools to generate a synthetic dataset of videos, and show that a classifier trained on these videos can generalise to real videos. We fine tune a pre-trained I3D model on our videos, and find that the model is able to achieve a high accuracy of 73% on the HMDB51 dataset over three classes.
Score: 0.8250374560598496
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Creating and labelling datasets of videos for use in training Human Activity Recognition models is an arduous task. In this paper, we approach this by using 3D rendering tools to generate a synthetic dataset of videos, and show that a classifier trained on these videos can generalise to real videos. We use five different augmentation techniques to generate the videos, leading to a wide variety of accurately labelled unique videos. We fine tune a pre-trained I3D model on our videos, and find that the model is able to achieve a high accuracy of 73% on the HMDB51 dataset over three classes. We also find that augmenting the HMDB training set with our dataset provides a 2% improvement in the performance of the classifier. Finally, we discuss possible extensions to the dataset, including virtual try on and modeling motion of the people.

Related papers

Leveraging Pre-Trained Visual Models for AI-Generated Video Detection [54.88903878778194]
The field of video generation has advanced beyond DeepFakes, creating an urgent need for methods capable of detecting AI-generated videos with generic content.<n>We propose a novel approach that leverages pre-trained visual models to distinguish between real and generated videos.<n>Our method achieves high detection accuracy, above 90% on average, underscoring its effectiveness.
arXiv Detail & Related papers (2025-07-17T15:36:39Z)
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding [126.15907330726067]
We build a Perception Model Language (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from models and explore large-scale synthetic data to identify critical data gaps.
arXiv Detail & Related papers (2025-04-17T17:59:56Z)
NIL: No-data Imitation Learning by Leveraging Pre-trained Video Diffusion Models [36.05972290909729]
We propose a data-independent approach for skill acquisition that learns 3D motor skills from 2D-generated videos. In humanoid robot tasks, we demonstrate that 'No-data Imitation Learning' (NIL) outperforms baselines trained on 3D motion-capture data.
arXiv Detail & Related papers (2025-03-13T17:59:24Z)
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning [71.02843679746563]
In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature. In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process. We propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information.
arXiv Detail & Related papers (2025-03-02T18:49:48Z)
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos [119.35107657321902]
This work explores whether a deep generative model can learn complex knowledge solely from visual input. We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks.
arXiv Detail & Related papers (2025-01-16T18:59:10Z)
OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation [27.516068877910254]
We introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs.
arXiv Detail & Related papers (2024-11-28T07:01:06Z)
3D-VirtFusion: Synthetic 3D Data Augmentation through Generative Diffusion Models and Controllable Editing [52.68314936128752]
We propose a new paradigm to automatically generate 3D labeled training data by harnessing the power of pretrained large foundation models. For each target semantic class, we first generate 2D images of a single object in various structure and appearance via diffusion models and chatGPT generated text prompts. We transform these augmented images into 3D objects and construct virtual scenes by random composition.
arXiv Detail & Related papers (2024-08-25T09:31:22Z)
Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks [0.0]
We propose to provide instructional datasets specific to the task of each modality within a distinct domain. We use Video-LLaVA to generate recipes given cooking videos without transcripts. Our approach to fine-tuning Video-LLaVA leads to gains over the baseline Video-LLaVA by 2% on the YouCook2 dataset.
arXiv Detail & Related papers (2024-06-24T06:39:02Z)
Distinguish Any Fake Videos: Unleashing the Power of Large-scale Data and Motion Features [21.583246378475856]
We introduce an extensive video dataset designed specifically for AI-Generated Video Detection (GenVidDet) We also present the Dual-Branch 3D Transformer (DuB3D), an innovative and effective method for distinguishing between real and generated videos. DuB3D can distinguish between real and generated video content with 96.77% accuracy, and strong generalization capability even for unseen types.
arXiv Detail & Related papers (2024-05-24T08:26:04Z)
Advancing Human Action Recognition with Foundation Models trained on Unlabeled Public Videos [2.3247413495885647]
We use 283,582 unique, unlabeled TikTok video clips, categorized into 386 hashtags, to train a domain-specific foundation model for action recognition. Our model achieves state-of-the-art results: 99.05% on UCF101, 86.08% on HMDB51, 85.51% on Kinetics-400, and 74.27% on Something-Something V2 using the ViT-giant backbone.
arXiv Detail & Related papers (2024-02-14T00:41:10Z)
AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z)
InternVideo: General Video Foundation Models via Generative and Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks. InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives. InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z)
Revisiting Classifier: Transferring Vision-Language Models for Video Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. In this study, we focus on transferring knowledge for video classification tasks. We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z)
ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification. Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers. We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.