Related papers: Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions

Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions

URL: http://arxiv.org/abs/2507.02900v1
Date: Mon, 23 Jun 2025 06:49:42 GMT
Title: Advancing Talking Head Generation: A Comprehensive Survey of Multi-Modal Methodologies, Datasets, Evaluation Metrics, and Loss Functions
Authors: Vineet Kumar Rakesh, Soumya Mazumdar, Research Pratim Maity, Sarbajit Pal, Amitabha Das, Tapas Samanta,
Abstract summary: Talking Head Generation (THG) has emerged as a transformative technology in computer vision.<n>This paper provides a comprehensive review of methodologies and frameworks for talking head generation.<n>The study identifies challenges such as reliance on pre-trained models, extreme pose handling, multilingual synthesis, and temporal consistency.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Talking Head Generation (THG) has emerged as a transformative technology in computer vision, enabling the synthesis of realistic human faces synchronized with image, audio, text, or video inputs. This paper provides a comprehensive review of methodologies and frameworks for talking head generation, categorizing approaches into 2D--based, 3D--based, Neural Radiance Fields (NeRF)--based, diffusion--based, parameter-driven techniques and many other techniques. It evaluates algorithms, datasets, and evaluation metrics while highlighting advancements in perceptual realism and technical efficiency critical for applications such as digital avatars, video dubbing, ultra-low bitrate video conferencing, and online education. The study identifies challenges such as reliance on pre--trained models, extreme pose handling, multilingual synthesis, and temporal consistency. Future directions include modular architectures, multilingual datasets, hybrid models blending pre--trained and task-specific layers, and innovative loss functions. By synthesizing existing research and exploring emerging trends, this paper aims to provide actionable insights for researchers and practitioners in the field of talking head generation. For the complete survey, code, and curated resource list, visit our GitHub repository: https://github.com/VineetKumarRakesh/thg.

Related papers

Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey [154.50661618628433]
3D reconstruction and view synthesis are foundational problems in computer vision, graphics, and immersive technologies such as augmented reality (AR), virtual reality (VR), and digital twins.<n>Recent advances in feed-forward approaches, driven by deep learning, have revolutionized this field by enabling fast and generalizable 3D reconstruction and view synthesis.
arXiv Detail & Related papers (2025-07-19T06:13:25Z)
Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities.<n>Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures.<n>Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL.<n>Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios.
arXiv Detail & Related papers (2025-02-26T17:26:36Z)
Automatic Scene Generation: State-of-the-Art Techniques, Models, Datasets, Challenges, and Future Prospects [0.94371657253557]
This survey focuses on techniques that leverage machine learning, deep learning, embedded systems, and natural language processing (NLP) We categorize the models into four main types: Variational Autoencoders (VAEs), Generative Adrial Networks (GANs), Transformers, and Diffusion Models. We also review the most commonly used datasets, such as COCO-Stuff, Visual Genome, and MS-COCO, which are critical for training and evaluating these models.
arXiv Detail & Related papers (2024-09-14T19:09:10Z)
A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing [8.171572460041823]
Talking head synthesis is an advanced method for generating portrait videos from a still image driven by specific content. This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques.
arXiv Detail & Related papers (2024-06-15T08:14:59Z)
A survey of synthetic data augmentation methods in computer vision [0.0]
This paper presents an extensive review of synthetic data augmentation techniques. We focus on the important data generation and augmentation techniques, general scope of application and specific use-cases. We provide a summary of common synthetic datasets for training computer vision models.
arXiv Detail & Related papers (2024-03-15T07:34:08Z)
Data Augmentation in Human-Centric Vision [54.97327269866757]
This survey presents a comprehensive analysis of data augmentation techniques in human-centric vision tasks. It delves into a wide range of research areas including person ReID, human parsing, human pose estimation, and pedestrian detection. Our work categorizes data augmentation methods into two main types: data generation and data perturbation.
arXiv Detail & Related papers (2024-03-13T16:05:18Z)
A Comprehensive Survey of 3D Dense Captioning: Localizing and Describing Objects in 3D Scenes [80.20670062509723]
3D dense captioning is an emerging vision-language bridging task that aims to generate detailed descriptions for 3D scenes. It presents significant potential and challenges due to its closer representation of the real world compared to 2D visual captioning. Despite the popularity and success of existing methods, there is a lack of comprehensive surveys summarizing the advancements in this field.
arXiv Detail & Related papers (2024-03-12T10:04:08Z)
A Comparative Study of Perceptual Quality Metrics for Audio-driven Talking Head Videos [81.54357891748087]
We collect talking head videos generated from four generative methods. We conduct controlled psychophysical experiments on visual quality, lip-audio synchronization, and head movement naturalness. Our experiments validate consistency between model predictions and human annotations, identifying metrics that align better with human opinions than widely-used measures.
arXiv Detail & Related papers (2024-03-11T04:13:38Z)
Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation [30.245348014602577]
We discuss the evolution of video generation from text, starting with animating MNIST numbers to simulating the physical world with Sora. Our review into the shortcomings of Sora-generated videos pinpoints the call for more in-depth studies in various enabling aspects of video generation. We conclude that the study of the text-to-video generation may still be in its infancy, requiring contribution from the cross-discipline research community.
arXiv Detail & Related papers (2024-03-08T07:58:13Z)
From Pixels to Portraits: A Comprehensive Survey of Talking Head Generation Techniques and Applications [3.8301843990331887]
Recent advancements in deep learning and computer vision have led to a surge of interest in generating realistic talking heads. We systematically categorise them into four main approaches: image-driven, audio-driven, video-driven and others. We provide an in-depth analysis of each method, highlighting their unique contributions, strengths, and limitations.
arXiv Detail & Related papers (2023-08-30T14:00:48Z)
Foundational Models Defining a New Era in Vision: A Survey and Outlook [151.49434496615427]
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The models learned to bridge the gap between such modalities coupled with large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions.
arXiv Detail & Related papers (2023-07-25T17:59:18Z)
Human Image Generation: A Comprehensive Survey [44.204029557298476]
In this paper, we divide human image generation techniques into three paradigms, i.e., data-driven methods, knowledge-guided methods and hybrid methods. The advantages and characteristics of different methods are summarized in terms of model architectures. Due to the wide application potentials, the typical downstream usages of synthesized human images are covered.
arXiv Detail & Related papers (2022-12-17T15:19:45Z)
Vision+X: A Survey on Multimodal Learning in the Light of Data [64.03266872103835]
multimodal machine learning that incorporates data from various sources has become an increasingly popular research area. We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions. We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels.
arXiv Detail & Related papers (2022-10-05T13:14:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.