Related papers: InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation

URL: http://arxiv.org/abs/2405.15758v1
Date: Fri, 24 May 2024 17:53:54 GMT
Title: InstructAvatar: Text-Guided Emotion and Motion Control for Avatar Generation
Authors: Yuchi Wang, Junliang Guo, Jianhong Bai, Runyi Yu, Tianyu He, Xu Tan, Xu Sun, Jiang Bian,
Abstract summary: In this paper, we propose a novel text-guided approach for generating emotionally expressive 2D avatars. Our framework, named InstructAvatar, leverages a natural language interface to control the emotion as well as the facial motion of avatars. Experimental results demonstrate that InstructAvatar produces results that align well with both conditions.
Score: 39.235962838952624
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent talking avatar generation models have made strides in achieving realistic and accurate lip synchronization with the audio, but often fall short in controlling and conveying detailed expressions and emotions of the avatar, making the generated video less vivid and controllable. In this paper, we propose a novel text-guided approach for generating emotionally expressive 2D avatars, offering fine-grained control, improved interactivity, and generalizability to the resulting video. Our framework, named InstructAvatar, leverages a natural language interface to control the emotion as well as the facial motion of avatars. Technically, we design an automatic annotation pipeline to construct an instruction-video paired training dataset, equipped with a novel two-branch diffusion-based generator to predict avatars with audio and text instructions at the same time. Experimental results demonstrate that InstructAvatar produces results that align well with both conditions, and outperforms existing methods in fine-grained emotion control, lip-sync quality, and naturalness. Our project page is https://wangyuchi369.github.io/InstructAvatar/.

Related papers

AV-Flow: Transforming Text to Audio-Visual Human-like Interactions [101.31009576033776]
AV-Flow is an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose.
arXiv Detail & Related papers (2025-02-18T18:56:18Z)
Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion [6.677873152109559]
Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation to facilitate more accurate attribute disentanglement. In the second stage, we introduce an emotion-control module to encode emotion control information into the latent space.
arXiv Detail & Related papers (2025-02-11T02:53:48Z)
TALK-Act: Enhance Textural-Awareness for 2D Speaking Avatar Reenactment with Diffusion Model [100.35665852159785]
We propose the Motion-Enhanced Textural-Aware ModeLing for SpeaKing Avatar Reenactment (TALK-Act) framework. Our key idea is to enhance the textural awareness with explicit motion guidance in diffusion modeling. Our model can achieve high-fidelity 2D avatar reenactment with only 30 seconds of person-specific data.
arXiv Detail & Related papers (2024-10-14T16:38:10Z)
TextToon: Real-Time Text Toonify Head Avatar from Single Video [34.07760625281835]
We propose TextToon, a method to generate a drivable toonified avatar. Given a short monocular video sequence and a written instruction about the avatar style, our model can generate a high-fidelity toonified avatar.
arXiv Detail & Related papers (2024-09-23T15:04:45Z)
AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text [71.09533176800707]
AvatarStudio is a coarse-to-fine generative model that generates explicit textured 3D meshes for animatable human avatars. By effectively leveraging the synergy between the articulated mesh representation and the DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars.
arXiv Detail & Related papers (2023-11-29T18:59:32Z)
GAIA: Zero-shot Talking Avatar Generation [64.78978434650416]
We introduce GAIA (Generative AI for Avatar), which eliminates the domain priors in talking avatar generation. GAIA beats previous baseline models in terms of naturalness, diversity, lip-sync quality, and visual quality. It is general and enables different applications like controllable talking avatar generation and text-instructed avatar generation.
arXiv Detail & Related papers (2023-11-26T08:04:43Z)
AvatarBooth: High-Quality and Customizable 3D Human Avatar Generation [14.062402203105712]
AvatarBooth is a novel method for generating high-quality 3D avatars using text prompts or specific images. Our key contribution is the precise avatar generation control by using dual fine-tuned diffusion models. We present a multi-resolution rendering strategy that facilitates coarse-to-fine supervision of 3D avatar generation.
arXiv Detail & Related papers (2023-06-16T14:18:51Z)
Emotional Speech-Driven Animation with Content-Emotion Disentanglement [51.34635009347183]
We propose EMOTE, which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. EmOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data.
arXiv Detail & Related papers (2023-06-15T09:31:31Z)
Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achieved yet. We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z)
READ Avatars: Realistic Emotion-controllable Audio Driven Avatars [11.98034899127065]
We present READ Avatars, a 3D-based approach for generating 2D avatars driven by audio input with direct and granular control over the emotion. Previous methods are unable to achieve realistic animation due to the many-to-many nature of audio to expression mappings. This removes the smoothing effect of regression-based models and helps to improve the realism and expressiveness of the generated avatars.
arXiv Detail & Related papers (2023-03-01T18:56:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.