OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
- URL: http://arxiv.org/abs/2502.01061v2
- Date: Thu, 13 Feb 2025 06:56:29 GMT
- Title: OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
- Authors: Gaojie Lin, Jianwen Jiang, Jiaqi Yang, Zerong Zheng, Chao Liang,
- Abstract summary: We propose a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase.
These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation.
Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs.
- Score: 25.45077656291886
- License:
- Abstract: End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (https://omnihuman-lab.github.io)
Related papers
- DirectorLLM for Human-Centric Video Generation [46.37441947526771]
We introduce DirectorLLM, a novel video generation model that employs a large language model (LLM) to orchestrate human poses within videos.
Our model outperforms existing ones in generating videos with higher human motion fidelity, improved prompt faithfulness, and enhanced rendered subject naturalness.
arXiv Detail & Related papers (2024-12-19T03:10:26Z) - Move-in-2D: 2D-Conditioned Human Motion Generation [54.067588636155115]
We propose Move-in-2D, a novel approach to generate human motion sequences conditioned on a scene image.
Our approach accepts both a scene image and text prompt as inputs, producing a motion sequence tailored to the scene.
arXiv Detail & Related papers (2024-12-17T18:58:07Z) - Harmon: Whole-Body Motion Generation of Humanoid Robots from Language Descriptions [31.134450087838673]
This work focuses on generating diverse whole-body motions for humanoid robots from language descriptions.
We leverage human motion priors from extensive human motion datasets to initialize humanoid motions.
Our approach demonstrates the capability to produce natural, expressive, and text-aligned humanoid motions.
arXiv Detail & Related papers (2024-10-16T17:48:50Z) - Massively Multi-Person 3D Human Motion Forecasting with Scene Context [13.197408989895102]
We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion.
We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information.
Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study.
arXiv Detail & Related papers (2024-09-18T17:58:51Z) - Body of Her: A Preliminary Study on End-to-End Humanoid Agent [0.8702432681310401]
We propose a real-time, duplex, interactive end-to-end network capable of modeling realistic agent behaviors.
This work performs a preliminary exploration of the end-to-end approach in this field, aiming to inspire further research towards scaling up.
arXiv Detail & Related papers (2024-08-06T01:13:09Z) - HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation [64.37874983401221]
We present HumanVid, the first large-scale high-quality dataset tailored for human image animation.
For the real-world data, we compile a vast collection of real-world videos from the internet.
For the synthetic data, we collected 10K 3D avatar assets and leveraged existing assets of body shapes, skin textures and clothings.
arXiv Detail & Related papers (2024-07-24T17:15:58Z) - Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training [69.54948297520612]
Learning a generalist embodied agent poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets.
We introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos.
Our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches.
arXiv Detail & Related papers (2024-02-22T09:48:47Z) - Image Comes Dancing with Collaborative Parsing-Flow Video Synthesis [124.48519390371636]
Transfering human motion from a source to a target person poses great potential in computer vision and graphics applications.
Previous work has either relied on crafted 3D human models or trained a separate model specifically for each target person.
This work studies a more general setting, in which we aim to learn a single model to parsimoniously transfer motion from a source video to any target person.
arXiv Detail & Related papers (2021-10-27T03:42:41Z) - S3: Neural Shape, Skeleton, and Skinning Fields for 3D Human Modeling [103.65625425020129]
We represent the pedestrian's shape, pose and skinning weights as neural implicit functions that are directly learned from data.
We demonstrate the effectiveness of our approach on various datasets and show that our reconstructions outperform existing state-of-the-art methods.
arXiv Detail & Related papers (2021-01-17T02:16:56Z) - Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking.
Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.