Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation
- URL: http://arxiv.org/abs/2409.17674v1
- Date: Thu, 26 Sep 2024 09:33:20 GMT
- Title: Self-Supervised Learning of Deviation in Latent Representation for Co-speech Gesture Video Generation
- Authors: Huan Yang, Jiahui Chen, Chaofan Ding, Runhua Shi, Siyu Xiong, Qingqi Hong, Xiaoqi Mo, Xinhan Di,
- Abstract summary: We explore the representation of gestures in co-speech with a focus on self-supervised representation and pixel-level motion deviation.
Our approach leverages self-supervised deviation in latent representation to facilitate hand gestures generation.
Results of our first experiment demonstrate that our method enhances the quality of generated videos.
- Score: 8.84657964527764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Gestures are pivotal in enhancing co-speech communication. While recent works have mostly focused on point-level motion transformation or fully supervised motion representations through data-driven approaches, we explore the representation of gestures in co-speech, with a focus on self-supervised representation and pixel-level motion deviation, utilizing a diffusion model which incorporates latent motion features. Our approach leverages self-supervised deviation in latent representation to facilitate hand gestures generation, which are crucial for generating realistic gesture videos. Results of our first experiment demonstrate that our method enhances the quality of generated videos, with an improvement from 2.7 to 4.5% for FGD, DIV, and FVD, and 8.1% for PSNR, 2.5% for SSIM over the current state-of-the-art methods.
Related papers
- Audio-driven Gesture Generation via Deviation Feature in the Latent Space [2.8952735126314733]
We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation.
Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation.
Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.
arXiv Detail & Related papers (2025-03-27T15:37:16Z) - High Quality Human Image Animation using Regional Supervision and Motion Blur Condition [97.97432499053966]
We leverage regional supervision for detailed regions to enhance face and hand faithfulness.
Second, we model the motion blur explicitly to further improve the appearance quality.
Third, we explore novel training strategies for high-resolution human animation to improve the overall fidelity.
arXiv Detail & Related papers (2024-09-29T06:46:31Z) - UniLearn: Enhancing Dynamic Facial Expression Recognition through Unified Pre-Training and Fine-Tuning on Images and Videos [83.48170683672427]
UniLearn is a unified learning paradigm that integrates static facial expression recognition data to enhance DFER task.
UniLearn consistently state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65%, 58.44%, and 76.68%, respectively.
arXiv Detail & Related papers (2024-09-10T01:57:57Z) - ViTGaze: Gaze Following with Interaction Features in Vision Transformers [42.08842391756614]
We introduce a novel single-modality gaze following framework called ViTGaze.
In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders.
Our method achieves state-of-the-art (SOTA) performance among all single-modality methods.
arXiv Detail & Related papers (2024-03-19T14:45:17Z) - Continuous Sign Language Recognition Based on Motor attention mechanism
and frame-level Self-distillation [17.518587972114567]
We propose a novel motor attention mechanism to capture the distorted changes in local motion regions during sign language expression.
For the first time, we apply the self-distillation method to frame-level feature extraction for continuous sign language.
arXiv Detail & Related papers (2024-02-29T12:52:50Z) - AnaMoDiff: 2D Analogical Motion Diffusion via Disentangled Denoising [25.839194626743126]
AnaMoDiff is a novel diffusion-based method for 2D motion analogies.
Our goal is to accurately transfer motions from a 2D driving video onto a source character, with its identity, in terms of appearance and natural movement.
arXiv Detail & Related papers (2024-02-05T22:10:54Z) - DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length.
By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z) - Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation.
M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse.
We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame.
Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency.
We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector.
We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z) - Adversarial Bipartite Graph Learning for Video Domain Adaptation [50.68420708387015]
Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area.
Recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations are not highly effective on the videos.
This paper proposes an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions.
arXiv Detail & Related papers (2020-07-31T03:48:41Z) - Moving fast and slow: Analysis of representations and post-processing in
speech-driven automatic gesture generation [7.6857153840014165]
We extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning.
Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.
We conclude that it is important to take both motion representation and post-processing into account when designing an automatic gesture-production method.
arXiv Detail & Related papers (2020-07-16T07:32:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.