Related papers: A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights

A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights

URL: http://arxiv.org/abs/2407.08428v1
Date: Thu, 11 Jul 2024 12:09:05 GMT
Title: A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights
Authors: Wentao Lei, Jinting Wang, Fengji Ma, Guanjie Huang, Li Liu,
Abstract summary: Human video generation aims to synthesize 2D human body video sequences with generative models given control conditions such as text, audio, and pose. Recent advancements in generative models have laid a solid foundation for the growing interest in this area. Despite the significant progress, the task of human video generation remains challenging due to the consistency of characters, the complexity of human motion, and difficulties in their relationship with the environment.
Score: 8.192172339127657
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Human video generation is a dynamic and rapidly evolving task that aims to synthesize 2D human body video sequences with generative models given control conditions such as text, audio, and pose. With the potential for wide-ranging applications in film, gaming, and virtual communication, the ability to generate natural and realistic human video is critical. Recent advancements in generative models have laid a solid foundation for the growing interest in this area. Despite the significant progress, the task of human video generation remains challenging due to the consistency of characters, the complexity of human motion, and difficulties in their relationship with the environment. This survey provides a comprehensive review of the current state of human video generation, marking, to the best of our knowledge, the first extensive literature review in this domain. We start with an introduction to the fundamentals of human video generation and the evolution of generative models that have facilitated the field's growth. We then examine the main methods employed for three key sub-tasks within human video generation: text-driven, audio-driven, and pose-driven motion generation. These areas are explored concerning the conditions that guide the generation process. Furthermore, we offer a collection of the most commonly utilized datasets and the evaluation metrics that are crucial in assessing the quality and realism of generated videos. The survey concludes with a discussion of the current challenges in the field and suggests possible directions for future research. The goal of this survey is to offer the research community a clear and holistic view of the advancements in human video generation, highlighting the milestones achieved and the challenges that lie ahead.

Related papers

3D Human Interaction Generation: A Survey [25.736432845850576]
3D human interaction generation focuses on producing dynamic and contextually relevant interactions between humans and interactive entities. Recent advancements in 3D model representation methods, motion capture technologies, and generative models have laid a solid foundation for the growing interest in this domain. Despite the rapid advancements in this area, challenges remain due to the need for naturalness in human motion generation and the accurate interaction between humans and interactive entities.
arXiv Detail & Related papers (2025-03-17T12:47:33Z)
What Are You Doing? A Closer Look at Controllable Human Video Generation [73.89117620413724]
What Are You Doing?' is a new benchmark for evaluation of controllable image-to-video generation of humans. It consists of 1,544 captioned videos that have been meticulously collected and annotated with 56 fine-grained categories. We perform in-depth analyses of seven state-of-the-art models in controllable image-to-video generation.
arXiv Detail & Related papers (2025-03-06T17:59:29Z)
ASurvey: Spatiotemporal Consistency in Video Generation [72.82267240482874]
Video generation schemes by leveraging a dynamic visual generation method, pushes the boundaries of Artificial Intelligence Generated Content (AIGC) Recent works have aimed at addressing thetemporal consistency issue in video generation, while few literature review has been organized from this perspective. We systematically review recent advances in video generation, covering five key aspects: foundation models, information representations, generation schemes, post-processing techniques, and evaluation metrics.
arXiv Detail & Related papers (2025-02-25T05:20:51Z)
OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation [27.516068877910254]
We introduce OpenHumanVid, a large-scale and high-quality human-centric video dataset. Our findings yield two critical insights: First, the incorporation of a large-scale, high-quality dataset substantially enhances evaluation metrics for generated human videos. Second, the effective alignment of text with human appearance, human motion, and facial motion is essential for producing high-quality video outputs.
arXiv Detail & Related papers (2024-11-28T07:01:06Z)
Deepfake Generation and Detection: A Benchmark and Survey [134.19054491600832]
Deepfake is a technology dedicated to creating highly realistic facial images and videos under specific conditions. This survey comprehensively reviews the latest developments in deepfake generation and detection. We focus on researching four representative deepfake fields: face swapping, face reenactment, talking face generation, and facial attribute editing.
arXiv Detail & Related papers (2024-03-26T17:12:34Z)
A Survey on Long Video Generation: Challenges, Methods, and Prospects [36.58662591921549]
This paper presents the first survey of recent advancements in long video generation. We summarise them into two key paradigms: divide and conquer temporal autoregressive. We offer a comprehensive overview and classification of the datasets and evaluation metrics which are crucial for advancing long video generation research.
arXiv Detail & Related papers (2024-03-25T03:47:53Z)
Data Augmentation in Human-Centric Vision [54.97327269866757]
This survey presents a comprehensive analysis of data augmentation techniques in human-centric vision tasks. It delves into a wide range of research areas including person ReID, human parsing, human pose estimation, and pedestrian detection. Our work categorizes data augmentation methods into two main types: data generation and data perturbation.
arXiv Detail & Related papers (2024-03-13T16:05:18Z)
Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation [30.245348014602577]
We discuss the evolution of video generation from text, starting with animating MNIST numbers to simulating the physical world with Sora. Our review into the shortcomings of Sora-generated videos pinpoints the call for more in-depth studies in various enabling aspects of video generation. We conclude that the study of the text-to-video generation may still be in its infancy, requiring contribution from the cross-discipline research community.
arXiv Detail & Related papers (2024-03-08T07:58:13Z)
Human Motion Generation: A Survey [67.38982546213371]
Human motion generation aims to generate natural human pose sequences and shows immense potential for real-world applications. Most research within this field focuses on generating human motions based on conditional signals, such as text, audio, and scene contexts. We present a comprehensive literature review of human motion generation, which is the first of its kind in this field.
arXiv Detail & Related papers (2023-07-20T14:15:20Z)
A Comprehensive Review of Data-Driven Co-Speech Gesture Generation [11.948557523215316]
The automatic generation of such co-speech gestures is a long-standing problem in computer animation. Gesture generation has seen surging interest recently, owing to the emergence of more and larger datasets of human gesture motion. This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models.
arXiv Detail & Related papers (2023-01-13T00:20:05Z)
StyleGAN-Human: A Data-Centric Odyssey of Human Generation [96.7080874757475]
This work takes a data-centric perspective and investigates multiple critical aspects in "data engineering" We collect and annotate a large-scale human image dataset with over 230K samples capturing diverse poses and textures. We rigorously investigate three essential factors in data engineering for StyleGAN-based human generation, namely data size, data distribution, and data alignment.
arXiv Detail & Related papers (2022-04-25T17:55:08Z)
Deep Person Generation: A Survey from the Perspective of Face, Pose and Cloth Synthesis [55.72674354651122]
We first summarize the scope of person generation, then systematically review recent progress and technical trends in deep person generation. More than two hundred papers are covered for a thorough overview, and the milestone works are highlighted to witness the major technical breakthrough. We hope this survey could shed some light on the future prospects of deep person generation, and provide a helpful foundation for full applications towards digital human.
arXiv Detail & Related papers (2021-09-05T14:15:24Z)
Human Motion Transfer from Poses in the Wild [61.6016458288803]
We tackle the problem of human motion transfer, where we synthesize novel motion video for a target person that imitates the movement from a reference video. It is a video-to-video translation task in which the estimated poses are used to bridge two domains. We introduce a novel pose-to-video translation framework for generating high-quality videos that are temporally coherent even for in-the-wild pose sequences unseen during training.
arXiv Detail & Related papers (2020-04-07T05:59:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.