ESARM: 3D Emotional Speech-to-Animation via Reward Model from Automatically-Ranked Demonstrations
- URL: http://arxiv.org/abs/2411.13089v2
- Date: Mon, 25 Nov 2024 21:12:25 GMT
- Title: ESARM: 3D Emotional Speech-to-Animation via Reward Model from Automatically-Ranked Demonstrations
- Authors: Xulong Zhang, Xiaoyang Qu, Haoxiang Shi, Chunguang Xiao, Jianzong Wang,
- Abstract summary: This paper proposes a novel 3D speech-to-animation (STA) generation framework to address the shortcomings of existing models.
We introduce a novel STA model coupled with a reward model. This combination enables the decoupling of emotion and content under audio conditions.
We conduct extensive empirical experiments on a benchmark dataset, and the results validate the effectiveness of our proposed framework.
- Score: 22.85503397110192
- License:
- Abstract: This paper proposes a novel 3D speech-to-animation (STA) generation framework designed to address the shortcomings of existing models in producing diverse and emotionally resonant animations. Current STA models often generate animations that lack emotional depth and variety, failing to align with human expectations. To overcome these limitations, we introduce a novel STA model coupled with a reward model. This combination enables the decoupling of emotion and content under audio conditions through a cross-coupling training approach. Additionally, we develop a training methodology that leverages automatic quality evaluation of generated facial animations to guide the reinforcement learning process. This methodology encourages the STA model to explore a broader range of possibilities, resulting in the generation of diverse and emotionally expressive facial animations of superior quality. We conduct extensive empirical experiments on a benchmark dataset, and the results validate the effectiveness of our proposed framework in generating high-quality, emotionally rich 3D animations that are better aligned with human preferences.
Related papers
- MagicArticulate: Make Your 3D Models Articulation-Ready [109.35703811628045]
We present MagicArticulate, an effective framework that automatically transforms static 3D models into articulation-ready assets.
Our key contributions are threefold. First, we introduce Articulation-averse benchmark containing over 33k 3D models with high-quality articulation annotations, carefully curated from XL-XL.
Extensive experiments demonstrate that MagicArticulate significantly outperforms existing methods across diverse object categories.
arXiv Detail & Related papers (2025-02-17T18:53:27Z) - Make-It-Animatable: An Efficient Framework for Authoring Animation-Ready 3D Characters [86.13319549186959]
We present Make-It-Animatable, a novel data-driven method to make any 3D humanoid model ready for character animation in less than one second.
Our framework generates high-quality blend weights, bones, and pose transformations.
Compared to existing methods, our approach demonstrates significant improvements in both quality and speed.
arXiv Detail & Related papers (2024-11-27T10:18:06Z) - ProbTalk3D: Non-Deterministic Emotion Controllable Speech-Driven 3D Facial Animation Synthesis Using VQ-VAE [0.0]
We argue that emotions and non-determinism are crucial to generate diverse and emotionally-rich facial animations.
We propose ProbTalk3D a non-deterministic neural network approach for emotion controllable speech-driven 3D facial animation synthesis.
arXiv Detail & Related papers (2024-09-12T11:53:05Z) - EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars [36.96390906514729]
MegaPortraits model has demonstrated state-of-the-art results in this domain.
We introduce our EMOPortraits model, where we: Enhance the model's capability to faithfully support intense, asymmetric face expressions.
We propose a novel multi-view video dataset featuring a wide range of intense and asymmetric facial expressions.
arXiv Detail & Related papers (2024-04-29T21:23:29Z) - CSTalk: Correlation Supervised Speech-driven 3D Emotional Facial Animation Generation [13.27632316528572]
Speech-driven 3D facial animation technology has been developed for years, but its practical application still lacks expectations.
Main challenges lie in data limitations, lip alignment, and the naturalness of facial expressions.
This paper proposes a method called CSTalk that models the correlations among different regions of facial movements and supervises the training of the generative model to generate realistic expressions.
arXiv Detail & Related papers (2024-04-29T11:19:15Z) - Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance [48.986552871497]
We introduce a novel two-stage framework that employs scene affordance as an intermediate representation.
By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals.
Our approach consistently outperforms all baselines on established benchmarks, including HumanML3D and HUMANISE.
arXiv Detail & Related papers (2024-03-26T18:41:07Z) - AnimateMe: 4D Facial Expressions via Diffusion Models [72.63383191654357]
Recent advances in diffusion models have enhanced the capabilities of generative models in 2D animation.
We employ Graph Neural Networks (GNNs) as denoising diffusion models in a novel approach, formulating the diffusion process directly on the mesh space.
This facilitates the generation of facial deformations through a mesh-diffusion-based model.
arXiv Detail & Related papers (2024-03-25T21:40:44Z) - EmoVOCA: Speech-Driven Emotional 3D Talking Heads [12.161006152509653]
We propose an innovative data-driven technique for creating a synthetic dataset, called EmoVOCA.
We then designed and trained an emotional 3D talking head generator that accepts a 3D face, an audio file, an emotion label, and an intensity value as inputs, and learns to animate the audio-synchronized lip movements with expressive traits of the face.
arXiv Detail & Related papers (2024-03-19T16:33:26Z) - Real-time Animation Generation and Control on Rigged Models via Large
Language Models [50.034712575541434]
We introduce a novel method for real-time animation control and generation on rigged models using natural language input.
We embed a large language model (LLM) in Unity to output structured texts that can be parsed into diverse and realistic animations.
arXiv Detail & Related papers (2023-10-27T01:36:35Z) - Pose-Controllable 3D Facial Animation Synthesis using Hierarchical
Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention.
The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.