Related papers: When Less Is More: A Sparse Facial Motion Structure For Listening Motion Learning

When Less Is More: A Sparse Facial Motion Structure For Listening Motion Learning

URL: http://arxiv.org/abs/2504.05748v1
Date: Tue, 08 Apr 2025 07:25:12 GMT
Title: When Less Is More: A Sparse Facial Motion Structure For Listening Motion Learning
Authors: Tri Tung Nguyen Nguyen, Quang Tien Dam, Dinh Tuan Tran, Joo-Ho Lee,
Abstract summary: This study proposes a novel method for representing and predicting non-verbal facial motion by encoding long sequences into a sparse sequence of listenings and transition frames.<n>By identifying crucial motion steps and interpolating intermediate frames, our method preserves the temporal structure of motion while enhancing instance-wise diversity during the learning process.
Score: 1.2974519529978974
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effective human behavior modeling is critical for successful human-robot interaction. Current state-of-the-art approaches for predicting listening head behavior during dyadic conversations employ continuous-to-discrete representations, where continuous facial motion sequence is converted into discrete latent tokens. However, non-verbal facial motion presents unique challenges owing to its temporal variance and multi-modal nature. State-of-the-art discrete motion token representation struggles to capture underlying non-verbal facial patterns making training the listening head inefficient with low-fidelity generated motion. This study proposes a novel method for representing and predicting non-verbal facial motion by encoding long sequences into a sparse sequence of keyframes and transition frames. By identifying crucial motion steps and interpolating intermediate frames, our method preserves the temporal structure of motion while enhancing instance-wise diversity during the learning process. Additionally, we apply this novel sparse representation to the task of listening head prediction, demonstrating its contribution to improving the explanation of facial motion patterns.

Related papers

MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z)
IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation [58.297199313494]
Implicit methods capture motion semantics directly from driving video, but suffer from identity leakage and entanglement between motion and appearance.<n>We propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens.<n>Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity.
arXiv Detail & Related papers (2026-02-07T11:17:20Z)
Characterizing Motion Encoding in Video Diffusion Timesteps [50.13907856401258]
We study how motion is encoded in video diffusion timesteps by the trade-off between appearance editing and motion preservation.<n>We identify an early, motion-dominant regime and a later, appearance-dominant regime, yielding an operational motion-appearance boundary in timestep space.
arXiv Detail & Related papers (2025-12-18T21:20:54Z)
HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis [90.74616208952791]
HM-Talker is a novel framework for generating high-fidelity, temporally coherent talking heads.<n>Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment.
arXiv Detail & Related papers (2025-08-14T12:01:52Z)
KeyFace: Expressive Audio-Driven Facial Animation for Long Sequences via KeyFrame Interpolation [37.27908280809964]
KeyFace is a novel two-stage diffusion-based framework for facial animation. In the first stage, a model fills in the gaps between transitions, ensuring smooth and temporal coherence. To further enhance realism, we incorporate continuous emotion representations and handle a wide range of non-speech vocalizations (NSVs) Experimental results show that KeyFace outperforms state-of-the-art methods in generating natural, coherent facial animations over extended durations.
arXiv Detail & Related papers (2025-03-03T16:31:55Z)
KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding [19.15471840100407]
We present a novel approach for synthesizing 3D facial motions from audio sequences using key motion embeddings. Our method integrates linguistic and data-driven priors through two modules: the linguistic-based key motion acquisition and the cross-modal motion completion. The latter extends key motions into a full sequence of 3D talking faces guided by audio features, improving temporal coherence and audio-visual consistency.
arXiv Detail & Related papers (2024-09-02T09:41:24Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
Dyadic Interaction Modeling for Social Behavior Generation [6.626277726145613]
We present an effective framework for creating 3D facial motions in dyadic interactions. The heart of our framework is Dyadic Interaction Modeling (DIM), a pre-training approach. Experiments demonstrate the superiority of our framework in generating listener motions.
arXiv Detail & Related papers (2024-03-14T03:21:33Z)
Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation. M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse. We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z)
Spatio-Temporal Branching for Motion Prediction using Motion Increments [55.68088298632865]
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications. Traditional methods rely on hand-crafted features and machine learning techniques. We propose a noveltemporal-temporal branching network using incremental information for HMP.
arXiv Detail & Related papers (2023-08-02T12:04:28Z)
Persistent-Transient Duality: A Multi-mechanism Approach for Modeling Human-Object Interaction [58.67761673662716]
Humans are highly adaptable, swiftly switching between different modes to handle different tasks, situations and contexts. In Human-object interaction (HOI) activities, these modes can be attributed to two mechanisms: (1) the large-scale consistent plan for the whole activity and (2) the small-scale children interactive actions that start and end along the timeline. This work proposes to model two concurrent mechanisms that jointly control human motion.
arXiv Detail & Related papers (2023-07-24T12:21:33Z)
Pose-Controllable 3D Facial Animation Synthesis using Hierarchical Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention. The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z)
CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior [27.989344587876964]
Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness. We propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook. We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-01-06T05:04:32Z)
Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis [15.700918566471277]
We present a one-shot talking head synthesis method that achieves disentangled and fine-grained control over lip motion, eye gaze&blink, head pose, and emotional expression. We represent different motions via disentangled latent representations and leverage an image generator to synthesize talking heads from them.
arXiv Detail & Related papers (2022-11-26T07:52:46Z)
Dyadic Human Motion Prediction [119.3376964777803]
We introduce a motion prediction framework that explicitly reasons about the interactions of two observed subjects. Specifically, we achieve this by introducing a pairwise attention mechanism that models the mutual dependencies in the motion history of the two subjects. This allows us to preserve the long-term motion dynamics in a more realistic way and more robustly predict unusual and fast-paced movements.
arXiv Detail & Related papers (2021-12-01T10:30:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.