Related papers: VarGes: Improving Variation in Co-Speech 3D Gesture Generation via StyleCLIPS

VarGes: Improving Variation in Co-Speech 3D Gesture Generation via StyleCLIPS

URL: http://arxiv.org/abs/2502.10729v2
Date: Tue, 18 Feb 2025 08:07:39 GMT
Title: VarGes: Improving Variation in Co-Speech 3D Gesture Generation via StyleCLIPS
Authors: Ming Meng, Ke Mu, Yonggui Zhu, Zhe Zhu, Haoyu Sun, Heyang Yan, Zhaoxin Fan,
Abstract summary: VarGes is a novel variation-driven framework designed to enhance co-speech gesture generation.<n>Our approach is validated on benchmark datasets, where it outperforms existing methods in terms of gesture diversity and naturalness.
Score: 4.996271098355553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating expressive and diverse human gestures from audio is crucial in fields like human-computer interaction, virtual reality, and animation. Though existing methods have achieved remarkable performance, they often exhibit limitations due to constrained dataset diversity and the restricted amount of information derived from audio inputs. To address these challenges, we present VarGes, a novel variation-driven framework designed to enhance co-speech gesture generation by integrating visual stylistic cues while maintaining naturalness. Our approach begins with the Variation-Enhanced Feature Extraction (VEFE) module, which seamlessly incorporates \textcolor{blue}{style-reference} video data into a 3D human pose estimation network to extract StyleCLIPS, thereby enriching the input with stylistic information. Subsequently, we employ the Variation-Compensation Style Encoder (VCSE), a transformer-style encoder equipped with an additive attention mechanism pooling layer, to robustly encode diverse StyleCLIPS representations and effectively manage stylistic variations. Finally, the Variation-Driven Gesture Predictor (VDGP) module fuses MFCC audio features with StyleCLIPS encodings via cross-attention, injecting this fused data into a cross-conditional autoregressive model to modulate 3D human gesture generation based on audio input and stylistic clues. The efficacy of our approach is validated on benchmark datasets, where it outperforms existing methods in terms of gesture diversity and naturalness. The code and video results will be made publicly available upon acceptance:https://github.com/mookerr/VarGES/ .

Related papers

AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts [41.08576055846111]
Stereo-Talker is a novel one-shot audio-driven human video synthesis system. It generates 3D talking videos with precise lip synchronization, expressive body gestures, temporally consistent photo-realistic quality, and continuous viewpoint control.
arXiv Detail & Related papers (2024-10-31T11:32:33Z)
SVP: Style-Enhanced Vivid Portrait Talking Head Diffusion Model [64.28263381647628]
Talking Head Generation (THG) is an important task with broad application prospects in various fields such as digital humans, film production, and virtual reality.<n>We propose a novel framework named Style-Enhanced Vivid Portrait (SVP) which fully leverages style-related information in THG.<n>Our model generates diverse, vivid, and high-quality videos with flexible control over intrinsic styles, outperforming existing state-of-the-art methods.
arXiv Detail & Related papers (2024-09-05T06:27:32Z)
Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation [29.87407471246318]
This research delves into the complexities of synchronizing facial movements and creating visually appealing, temporally consistent animations. Our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module. The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.
arXiv Detail & Related papers (2024-06-13T04:33:20Z)
Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference [5.711221299998126]
Persona-Gestor is a novel end-to-end generative model designed to generate highly personalized 3D full-body gestures. The model combines a fuzzy feature extractor and a non-autoregressive Adaptive Layer Normalization (AdaLN) transformer diffusion architecture. Persona-Gestor improves the system's usability and generalization capabilities.
arXiv Detail & Related papers (2024-03-16T04:40:10Z)
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)
EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling [57.08286593059137]
We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures. We first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance.
arXiv Detail & Related papers (2023-12-31T02:25:41Z)
SAiD: Speech-driven Blendshape Facial Animation with Diffusion [6.4271091365094515]
Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets. We propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization.
arXiv Detail & Related papers (2023-12-25T04:40:32Z)
APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z)
DiffPoseTalk: Speech-Driven Stylistic 3D Facial Animation and Head Pose Generation via Diffusion Models [24.401443462720135]
We propose DiffPoseTalk, a generative framework based on the diffusion model combined with a style encoder. In particular, our style includes the generation of head poses, thereby enhancing user perception. We address the shortage of scanned 3D talking face data by training our model on reconstructed 3DMM parameters from a high-quality, in-the-wild audio-visual dataset.
arXiv Detail & Related papers (2023-09-30T17:01:18Z)
End-to-end Audio-visual Speech Recognition with Conformers [65.30276363777514]
We present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer) In particular, the audio and visual encoders learn to extract features directly from raw pixels and audio waveforms. We show that our proposed models raise the state-of-the-art performance by a large margin in audio-only, visual-only, and audio-visual experiments.
arXiv Detail & Related papers (2021-02-12T18:00:08Z)
Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task. We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame. Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.