Related papers: GAIA: Zero-shot Talking Avatar Generation

GAIA: Zero-shot Talking Avatar Generation

URL: http://arxiv.org/abs/2311.15230v2
Date: Thu, 14 Mar 2024 11:49:40 GMT
Title: GAIA: Zero-shot Talking Avatar Generation
Authors: Tianyu He, Junliang Guo, Runyi Yu, Yuchi Wang, Jialiang Zhu, Kaikai An, Leyi Li, Xu Tan, Chunyu Wang, Han Hu, HsiangTao Wu, Sheng Zhao, Jiang Bian,
Abstract summary: We introduce GAIA (Generative AI for Avatar), which eliminates the domain priors in talking avatar generation. GAIA beats previous baseline models in terms of naturalness, diversity, lip-sync quality, and visual quality. It is general and enables different applications like controllable talking avatar generation and text-instructed avatar generation.
Score: 64.78978434650416
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Zero-shot talking avatar generation aims at synthesizing natural talking videos from speech and a single portrait image. Previous methods have relied on domain-specific heuristics such as warping-based motion representation and 3D Morphable Models, which limit the naturalness and diversity of the generated avatars. In this work, we introduce GAIA (Generative AI for Avatar), which eliminates the domain priors in talking avatar generation. In light of the observation that the speech only drives the motion of the avatar while the appearance of the avatar and the background typically remain the same throughout the entire video, we divide our approach into two stages: 1) disentangling each frame into motion and appearance representations; 2) generating motion sequences conditioned on the speech and reference portrait image. We collect a large-scale high-quality talking avatar dataset and train the model on it with different scales (up to 2B parameters). Experimental results verify the superiority, scalability, and flexibility of GAIA as 1) the resulting model beats previous baseline models in terms of naturalness, diversity, lip-sync quality, and visual quality; 2) the framework is scalable since larger models yield better results; 3) it is general and enables different applications like controllable talking avatar generation and text-instructed avatar generation.

Related papers

JoyAvatar: Unlocking Highly Expressive Avatars via Harmonized Text-Audio Conditioning [18.72712280434528]
JoyAvatar is a framework capable of generating long duration avatar videos.<n>We introduce a twin-teacher enhanced training algorithm that enables the model to transfer inherent text-controllability.<n>During training, we dynamically modulate the strength of multi-modal conditions.
arXiv Detail & Related papers (2026-01-31T13:00:57Z)
Audio-Driven Universal Gaussian Head Avatars [66.56656075831954]
We introduce the first method for audio-driven universal photorealistic avatar synthesis.<n>It combines a person-agnostic speech model with our novel Universal Head Avatar Prior.<n>Our method is not only the first general audio-driven avatar model that can account for detailed appearance modeling and rendering.
arXiv Detail & Related papers (2025-09-23T12:46:43Z)
SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents [91.26239311240873]
SmartAvatar is a vision-language-agent-driven framework for generating fully rigged, animation-ready 3D human avatars.<n>A key innovation is an autonomous verification loop, where the agent renders draft avatars.<n>The generated avatars are fully rigged and support pose manipulation with consistent identity and appearance.
arXiv Detail & Related papers (2025-06-05T03:49:01Z)
EVA: Expressive Virtual Avatars from Multi-view Videos [51.33851869426057]
We introduce Expressive Virtual Avatars (EVA), an actor-specific, fully controllable, and expressive human avatar framework.<n>EVA achieves high-fidelity, lifelike renderings in real time while enabling independent control of facial expressions, body movements, and hand gestures.<n>This work represents a significant advancement towards fully drivable digital human models.
arXiv Detail & Related papers (2025-05-21T11:22:52Z)
Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis [44.503709089687014]
Speech-driven 3D head avatars must articulate their lips in accordance with speech.<n>The key problem is that deterministic models produce high-quality lip-sync but without rich expressions.<n>We propose THUNDER, a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production.
arXiv Detail & Related papers (2025-04-18T00:24:52Z)
GenEAva: Generating Cartoon Avatars with Fine-Grained Facial Expressions from Realistic Diffusion-based Faces [15.26953477181137]
We propose a novel framework, GenEAva, for generating high-quality cartoon avatars with fine-grained facial expressions. Our approach fine-tunes a state-of-the-art text-to-image diffusion model to synthesize highly detailed and expressive facial expressions. We introduce the first expressive cartoon avatar dataset, GenEAva 1.0, specifically designed to capture 135 fine-grained facial expressions.
arXiv Detail & Related papers (2025-04-10T17:54:02Z)
Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior [31.780579293685797]
We present Vid2Avatar-Pro, a method to create photorealistic and animatable 3D human avatars from monocular in-the-wild videos.
arXiv Detail & Related papers (2025-03-03T14:45:35Z)
One Shot, One Talk: Whole-body Talking Avatar from a Single Image [28.932709370417232]
Building realistic and animatable avatars still requires minutes of multi-view or monocular self-rotating videos. We propose a novel pipeline that tackles two critical issues: 1) complex dynamic modeling and 2) generalization to novel gestures and expressions. Our method enables the creation of a photorealistic, precisely animatable, and expressive whole-body talking avatar from just a single image.
arXiv Detail & Related papers (2024-12-02T04:27:41Z)
GaussianSpeech: Audio-Driven Gaussian Avatars [76.10163891172192]
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio. We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
arXiv Detail & Related papers (2024-11-27T18:54:08Z)
GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations [54.94362657501809]
We propose a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real-time. At the core of our method is a hierarchical representation of head models that allows to capture the complex dynamics of facial expressions and head movements. We train this coarse-to-fine facial avatar model along with the head pose as a learnable parameter in an end-to-end framework.
arXiv Detail & Related papers (2024-09-18T13:05:43Z)
DEGAS: Detailed Expressions on Full-Body Gaussian Avatars [13.683836322899953]
We present DEGAS, the first 3D Gaussian Splatting (3DGS)-based modeling method for full-body avatars with rich facial expressions. We propose to adopt the expression latent space trained solely on 2D portrait images, bridging the gap between 2D talking faces and 3D avatars.
arXiv Detail & Related papers (2024-08-20T06:52:03Z)
NPGA: Neural Parametric Gaussian Avatars [46.52887358194364]
We propose a data-driven approach to create high-fidelity controllable avatars from multi-view video recordings. We build our method around 3D Gaussian splatting for its highly efficient rendering and to inherit the topological flexibility of point clouds. We evaluate our method on the public NeRSemble dataset, demonstrating that NPGA significantly outperforms the previous state-of-the-art avatars on the self-reenactment task by 2.6 PSNR.
arXiv Detail & Related papers (2024-05-29T17:58:09Z)
GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image [89.70322127648349]
We propose a generic avatar editing approach that can be universally applied to various 3DMM driving volumetric head avatars. To achieve this goal, we design a novel expression-aware modification generative model, which enables lift 2D editing from a single image to a consistent 3D modification field.
arXiv Detail & Related papers (2024-04-02T17:58:35Z)
DivAvatar: Diverse 3D Avatar Generation with a Single Prompt [95.9978722953278]
DivAvatar is a framework that generates diverse avatars from a single text prompt. It has two key designs that help achieve generation diversity and visual quality. Extensive experiments show that DivAvatar is highly versatile in generating avatars of diverse appearances.
arXiv Detail & Related papers (2024-02-27T08:10:31Z)
Reality's Canvas, Language's Brush: Crafting 3D Avatars from Monocular Video [14.140380599168628]
ReCaLaB is a pipeline that learns high-fidelity 3D human avatars from just a single RGB video. A pose-conditioned NeRF is optimized to volumetrically represent a human subject in canonical T-pose. An image-conditioned diffusion model thereby helps to animate appearance and pose of the 3D avatar to create video sequences with previously unseen human motion.
arXiv Detail & Related papers (2023-12-08T01:53:06Z)
AvatarFusion: Zero-shot Generation of Clothing-Decoupled 3D Avatars Using 2D Diffusion [34.609403685504944]
We present AvatarFusion, a framework for zero-shot text-to-avatar generation. We use a latent diffusion model to provide pixel-level guidance for generating human-realistic avatars. We also introduce a novel optimization method, called Pixel-Semantics Difference-Sampling (PS-DS), which semantically separates the generation of body and clothes.
arXiv Detail & Related papers (2023-07-13T02:19:56Z)
AvatarBooth: High-Quality and Customizable 3D Human Avatar Generation [14.062402203105712]
AvatarBooth is a novel method for generating high-quality 3D avatars using text prompts or specific images. Our key contribution is the precise avatar generation control by using dual fine-tuned diffusion models. We present a multi-resolution rendering strategy that facilitates coarse-to-fine supervision of 3D avatar generation.
arXiv Detail & Related papers (2023-06-16T14:18:51Z)
OTAvatar: One-shot Talking Face Avatar with Controllable Tri-plane Rendering [81.55960827071661]
Controllability, generalizability and efficiency are the major objectives of constructing face avatars represented by neural implicit field. We propose One-shot Talking face Avatar (OTAvatar), which constructs face avatars by a generalized controllable tri-plane rendering solution.
arXiv Detail & Related papers (2023-03-26T09:12:03Z)
AvatarGen: a 3D Generative Model for Animatable Human Avatars [108.11137221845352]
AvatarGen is the first method that enables not only non-rigid human generation with diverse appearance but also full control over poses and viewpoints. To model non-rigid dynamics, it introduces a deformation network to learn pose-dependent deformations in the canonical space. Our method can generate animatable human avatars with high-quality appearance and geometry modeling, significantly outperforming previous 3D GANs.
arXiv Detail & Related papers (2022-08-01T01:27:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.