From Pixels to Portraits: A Comprehensive Survey of Talking Head
Generation Techniques and Applications
- URL: http://arxiv.org/abs/2308.16041v1
- Date: Wed, 30 Aug 2023 14:00:48 GMT
- Title: From Pixels to Portraits: A Comprehensive Survey of Talking Head
Generation Techniques and Applications
- Authors: Shreyank N Gowda, Dheeraj Pandey, Shashank Narayana Gowda
- Abstract summary: Recent advancements in deep learning and computer vision have led to a surge of interest in generating realistic talking heads.
We systematically categorise them into four main approaches: image-driven, audio-driven, video-driven and others.
We provide an in-depth analysis of each method, highlighting their unique contributions, strengths, and limitations.
- Score: 3.8301843990331887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in deep learning and computer vision have led to a surge
of interest in generating realistic talking heads. This paper presents a
comprehensive survey of state-of-the-art methods for talking head generation.
We systematically categorises them into four main approaches: image-driven,
audio-driven, video-driven and others (including neural radiance fields (NeRF),
and 3D-based methods). We provide an in-depth analysis of each method,
highlighting their unique contributions, strengths, and limitations.
Furthermore, we thoroughly compare publicly available models, evaluating them
on key aspects such as inference time and human-rated quality of the generated
outputs. Our aim is to provide a clear and concise overview of the current
landscape in talking head generation, elucidating the relationships between
different approaches and identifying promising directions for future research.
This survey will serve as a valuable reference for researchers and
practitioners interested in this rapidly evolving field.
Related papers
- Event-based Stereo Depth Estimation: A Survey [12.711235562366898]
Stereopsis has widespread appeal in robotics as it is the predominant way by which living beings perceive depth to navigate our 3D world.
Event cameras are novel bio-inspired sensors that detect per-pixel brightness changes asynchronously, with very high temporal resolution and high dynamic range.
The high temporal precision also benefits stereo matching, making disparity (depth) estimation a popular research area for event cameras ever since its inception.
arXiv Detail & Related papers (2024-09-26T09:43:50Z) - A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing [8.171572460041823]
Talking head synthesis is an advanced method for generating portrait videos from a still image driven by specific content.
This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques.
arXiv Detail & Related papers (2024-06-15T08:14:59Z) - Deep Learning-Based Object Pose Estimation: A Comprehensive Survey [73.74933379151419]
We discuss the recent advances in deep learning-based object pose estimation.
Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks.
arXiv Detail & Related papers (2024-05-13T14:44:22Z) - Deepfake Generation and Detection: A Benchmark and Survey [134.19054491600832]
Deepfake is a technology dedicated to creating highly realistic facial images and videos under specific conditions.
This survey comprehensively reviews the latest developments in deepfake generation and detection.
We focus on researching four representative deepfake fields: face swapping, face reenactment, talking face generation, and facial attribute editing.
arXiv Detail & Related papers (2024-03-26T17:12:34Z) - A Comparative Study of Perceptual Quality Metrics for Audio-driven
Talking Head Videos [81.54357891748087]
We collect talking head videos generated from four generative methods.
We conduct controlled psychophysical experiments on visual quality, lip-audio synchronization, and head movement naturalness.
Our experiments validate consistency between model predictions and human annotations, identifying metrics that align better with human opinions than widely-used measures.
arXiv Detail & Related papers (2024-03-11T04:13:38Z) - Vision+X: A Survey on Multimodal Learning in the Light of Data [64.03266872103835]
multimodal machine learning that incorporates data from various sources has become an increasingly popular research area.
We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions.
We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels.
arXiv Detail & Related papers (2022-10-05T13:14:57Z) - Neural Fields in Visual Computing and Beyond [54.950885364735804]
Recent advances in machine learning have created increasing interest in solving visual computing problems using coordinate-based neural networks.
neural fields have seen successful application in the synthesis of 3D shapes and image, animation of human bodies, 3D reconstruction, and pose estimation.
This report provides context, mathematical grounding, and an extensive review of literature on neural fields.
arXiv Detail & Related papers (2021-11-22T18:57:51Z) - Recent Advances in Monocular 2D and 3D Human Pose Estimation: A Deep
Learning Perspective [69.44384540002358]
We provide a comprehensive and holistic 2D-to-3D perspective to tackle this problem.
We categorize the mainstream and milestone approaches since the year 2014 under unified frameworks.
We also summarize the pose representation styles, benchmarks, evaluation metrics, and the quantitative performance of popular approaches.
arXiv Detail & Related papers (2021-04-23T11:07:07Z) - Deep Gait Recognition: A Survey [15.47582611826366]
Gait recognition is an appealing biometric modality which aims to identify individuals based on the way they walk.
Deep learning has reshaped the research landscape in this area since 2015 through the ability to automatically learn discriminative representations.
We present a comprehensive overview of breakthroughs and recent developments in gait recognition with deep learning.
arXiv Detail & Related papers (2021-02-18T18:49:28Z) - What comprises a good talking-head video generation?: A Survey and
Benchmark [40.26689818789428]
We present a benchmark for evaluating talking-head video generation with standardized dataset pre-processing strategies.
We propose new metrics or select the most appropriate ones to evaluate results in what we consider as desired properties for a good talking-head video.
arXiv Detail & Related papers (2020-05-07T01:58:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.