A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing
- URL: http://arxiv.org/abs/2406.10553v2
- Date: Tue, 18 Jun 2024 07:39:51 GMT
- Title: A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing
- Authors: Ming Meng, Yufei Zhao, Bo Zhang, Yonggui Zhu, Weimin Shi, Maxwell Wen, Zhaoxin Fan,
- Abstract summary: Talking head synthesis is an advanced method for generating portrait videos from a still image driven by specific content.
This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques.
- Score: 8.171572460041823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Talking head synthesis, an advanced method for generating portrait videos from a still image driven by specific content, has garnered widespread attention in virtual reality, augmented reality and game production. Recently, significant breakthroughs have been made with the introduction of novel models such as the transformer and the diffusion model. Current methods can not only generate new content but also edit the generated material. This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques. We summarize milestone studies and critically analyze their innovations and shortcomings within each domain. Additionally, we organize an extensive collection of datasets and provide a thorough performance analysis of current methodologies based on various evaluation metrics, aiming to furnish a clear framework and robust data support for future research. Finally, we explore application scenarios of talking head synthesis, illustrate them with specific cases, and examine potential future directions.
Related papers
- A Survey on Vision Autoregressive Model [15.042485771127346]
Autoregressive models have demonstrated impressive performance in natural language processing (NLP)
Inspired by their notable success in NLP field, autoregressive models have been intensively investigated recently for computer vision.
This paper provides a systematic review of vision autoregressive models, including the development of a taxonomy of existing methods.
arXiv Detail & Related papers (2024-11-13T14:59:41Z) - Video Diffusion Models: A Survey [3.7985353171858045]
Diffusion generative models have recently become a powerful technique for creating and modifying high-quality, coherent video content.
This survey provides an overview of the critical components of diffusion models for video generation, including their applications, architectural design, and temporal dynamics modeling.
arXiv Detail & Related papers (2024-05-06T04:01:42Z) - Deepfake Generation and Detection: A Benchmark and Survey [134.19054491600832]
Deepfake is a technology dedicated to creating highly realistic facial images and videos under specific conditions.
This survey comprehensively reviews the latest developments in deepfake generation and detection.
We focus on researching four representative deepfake fields: face swapping, face reenactment, talking face generation, and facial attribute editing.
arXiv Detail & Related papers (2024-03-26T17:12:34Z) - Recent Trends in 3D Reconstruction of General Non-Rigid Scenes [104.07781871008186]
Reconstructing models of the real world, including 3D geometry, appearance, and motion of real scenes, is essential for computer graphics and computer vision.
It enables the synthesizing of photorealistic novel views, useful for the movie industry and AR/VR applications.
This state-of-the-art report (STAR) offers the reader a comprehensive summary of state-of-the-art techniques with monocular and multi-view inputs.
arXiv Detail & Related papers (2024-03-22T09:46:11Z) - A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing
Objects in 3D Scenes [80.20670062509723]
3D dense captioning is an emerging vision-language bridging task that aims to generate detailed descriptions for 3D scenes.
It presents significant potential and challenges due to its closer representation of the real world compared to 2D visual captioning.
Despite the popularity and success of existing methods, there is a lack of comprehensive surveys summarizing the advancements in this field.
arXiv Detail & Related papers (2024-03-12T10:04:08Z) - Transformers and Language Models in Form Understanding: A Comprehensive
Review of Scanned Document Analysis [16.86139440201837]
We focus on the topic of form understanding in the context of scanned documents.
Our research methodology involves an in-depth analysis of popular documents and forms of understanding of trends over the last decade.
We showcase how transformers have propelled the field forward, revolutionizing form-understanding techniques.
arXiv Detail & Related papers (2024-03-06T22:22:02Z) - Controllable Topic-Focused Abstractive Summarization [57.8015120583044]
Controlled abstractive summarization focuses on producing condensed versions of a source article to cover specific aspects.
This paper presents a new Transformer-based architecture capable of producing topic-focused summaries.
arXiv Detail & Related papers (2023-11-12T03:51:38Z) - State of the Art on Diffusion Models for Visual Computing [191.6168813012954]
This report introduces the basic mathematical concepts of diffusion models, implementation details and design choices of the popular Stable Diffusion model.
We also give a comprehensive overview of the rapidly growing literature on diffusion-based generation and editing.
We discuss available datasets, metrics, open challenges, and social implications.
arXiv Detail & Related papers (2023-10-11T05:32:29Z) - From Pixels to Portraits: A Comprehensive Survey of Talking Head
Generation Techniques and Applications [3.8301843990331887]
Recent advancements in deep learning and computer vision have led to a surge of interest in generating realistic talking heads.
We systematically categorise them into four main approaches: image-driven, audio-driven, video-driven and others.
We provide an in-depth analysis of each method, highlighting their unique contributions, strengths, and limitations.
arXiv Detail & Related papers (2023-08-30T14:00:48Z) - Multimodal Image Synthesis and Editing: The Generative AI Era [131.9569600472503]
multimodal image synthesis and editing has become a hot research topic in recent years.
We comprehensively contextualize the advance of the recent multimodal image synthesis and editing.
We describe benchmark datasets and evaluation metrics as well as corresponding experimental results.
arXiv Detail & Related papers (2021-12-27T10:00:16Z) - Adversarial Text-to-Image Synthesis: A Review [7.593633267653624]
We contextualize the state of the art of adversarial text-to-image synthesis models, their development since their inception five years ago, and propose a taxonomy based on the level of supervision.
We critically examine current strategies to evaluate text-to-image synthesis models, highlight shortcomings, and identify new areas of research, ranging from the development of better datasets and evaluation metrics to possible improvements in architectural design and model training.
This review complements previous surveys on generative adversarial networks with a focus on text-to-image synthesis which we believe will help researchers to further advance the field.
arXiv Detail & Related papers (2021-01-25T09:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.