A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights
- URL: http://arxiv.org/abs/2407.08428v1
- Date: Thu, 11 Jul 2024 12:09:05 GMT
- Title: A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights
- Authors: Wentao Lei, Jinting Wang, Fengji Ma, Guanjie Huang, Li Liu,
- Abstract summary: Human video generation aims to synthesize 2D human body video sequences with generative models given control conditions such as text, audio, and pose.
Recent advancements in generative models have laid a solid foundation for the growing interest in this area.
Despite the significant progress, the task of human video generation remains challenging due to the consistency of characters, the complexity of human motion, and difficulties in their relationship with the environment.
- Score: 8.192172339127657
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Human video generation is a dynamic and rapidly evolving task that aims to synthesize 2D human body video sequences with generative models given control conditions such as text, audio, and pose. With the potential for wide-ranging applications in film, gaming, and virtual communication, the ability to generate natural and realistic human video is critical. Recent advancements in generative models have laid a solid foundation for the growing interest in this area. Despite the significant progress, the task of human video generation remains challenging due to the consistency of characters, the complexity of human motion, and difficulties in their relationship with the environment. This survey provides a comprehensive review of the current state of human video generation, marking, to the best of our knowledge, the first extensive literature review in this domain. We start with an introduction to the fundamentals of human video generation and the evolution of generative models that have facilitated the field's growth. We then examine the main methods employed for three key sub-tasks within human video generation: text-driven, audio-driven, and pose-driven motion generation. These areas are explored concerning the conditions that guide the generation process. Furthermore, we offer a collection of the most commonly utilized datasets and the evaluation metrics that are crucial in assessing the quality and realism of generated videos. The survey concludes with a discussion of the current challenges in the field and suggests possible directions for future research. The goal of this survey is to offer the research community a clear and holistic view of the advancements in human video generation, highlighting the milestones achieved and the challenges that lie ahead.
Related papers
- Towards Generalist Robot Learning from Internet Video: A Survey [56.621902345314645]
Scaling deep learning to huge internet-scraped datasets has yielded remarkably general capabilities in natural language processing and visual understanding and generation.
Data is scarce and expensive to collect in robotics. This has seen robot learning struggle to match the generality of capabilities observed in other domains.
Learning from Videos (LfV) methods seek to address this data bottleneck by augmenting traditional robot data with large internet-scraped video datasets.
arXiv Detail & Related papers (2024-04-30T15:57:41Z) - Deepfake Generation and Detection: A Benchmark and Survey [134.19054491600832]
Deepfake is a technology dedicated to creating highly realistic facial images and videos under specific conditions.
This survey comprehensively reviews the latest developments in deepfake generation and detection.
We focus on researching four representative deepfake fields: face swapping, face reenactment, talking face generation, and facial attribute editing.
arXiv Detail & Related papers (2024-03-26T17:12:34Z) - A Survey on Long Video Generation: Challenges, Methods, and Prospects [36.58662591921549]
This paper presents the first survey of recent advancements in long video generation.
We summarise them into two key paradigms: divide and conquer temporal autoregressive.
We offer a comprehensive overview and classification of the datasets and evaluation metrics which are crucial for advancing long video generation research.
arXiv Detail & Related papers (2024-03-25T03:47:53Z) - Data Augmentation in Human-Centric Vision [54.97327269866757]
This survey presents a comprehensive analysis of data augmentation techniques in human-centric vision tasks.
It delves into a wide range of research areas including person ReID, human parsing, human pose estimation, and pedestrian detection.
Our work categorizes data augmentation methods into two main types: data generation and data perturbation.
arXiv Detail & Related papers (2024-03-13T16:05:18Z) - Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation [30.245348014602577]
We discuss the evolution of video generation from text, starting with animating MNIST numbers to simulating the physical world with Sora.
Our review into the shortcomings of Sora-generated videos pinpoints the call for more in-depth studies in various enabling aspects of video generation.
We conclude that the study of the text-to-video generation may still be in its infancy, requiring contribution from the cross-discipline research community.
arXiv Detail & Related papers (2024-03-08T07:58:13Z) - Human Motion Generation: A Survey [67.38982546213371]
Human motion generation aims to generate natural human pose sequences and shows immense potential for real-world applications.
Most research within this field focuses on generating human motions based on conditional signals, such as text, audio, and scene contexts.
We present a comprehensive literature review of human motion generation, which is the first of its kind in this field.
arXiv Detail & Related papers (2023-07-20T14:15:20Z) - A Comprehensive Review of Data-Driven Co-Speech Gesture Generation [11.948557523215316]
The automatic generation of such co-speech gestures is a long-standing problem in computer animation.
Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion.
This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models.
arXiv Detail & Related papers (2023-01-13T00:20:05Z) - StyleGAN-Human: A Data-Centric Odyssey of Human Generation [96.7080874757475]
This work takes a data-centric perspective and investigates multiple critical aspects in "data engineering"
We collect and annotate a large-scale human image dataset with over 230K samples capturing diverse poses and textures.
We rigorously investigate three essential factors in data engineering for StyleGAN-based human generation, namely data size, data distribution, and data alignment.
arXiv Detail & Related papers (2022-04-25T17:55:08Z) - Deep Person Generation: A Survey from the Perspective of Face, Pose and
Cloth Synthesis [55.72674354651122]
We first summarize the scope of person generation, then systematically review recent progress and technical trends in deep person generation.
More than two hundred papers are covered for a thorough overview, and the milestone works are highlighted to witness the major technical breakthrough.
We hope this survey could shed some light on the future prospects of deep person generation, and provide a helpful foundation for full applications towards digital human.
arXiv Detail & Related papers (2021-09-05T14:15:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.