GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling
- URL: http://arxiv.org/abs/2501.18898v2
- Date: Sat, 08 Mar 2025 01:34:25 GMT
- Title: GestureLSM: Latent Shortcut based Co-Speech Gesture Generation with Spatial-Temporal Modeling
- Authors: Pinxin Liu, Luchuan Song, Junhua Huang, Haiyang Liu, Chenliang Xu,
- Abstract summary: GestureLSM is a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling.<n>It achieves state-of-the-art performance on BEAT2 while significantly reducing inference time compared to existing methods.
- Score: 32.47567372398872
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating full-body human gestures based on speech signals remains challenges on quality and speed. Existing approaches model different body regions such as body, legs and hands separately, which fail to capture the spatial interactions between them and result in unnatural and disjointed movements. Additionally, their autoregressive/diffusion-based pipelines show slow generation speed due to dozens of inference steps. To address these two challenges, we propose GestureLSM, a flow-matching-based approach for Co-Speech Gesture Generation with spatial-temporal modeling. Our method i) explicitly model the interaction of tokenized body regions through spatial and temporal attention, for generating coherent full-body gestures. ii) introduce the flow matching to enable more efficient sampling by explicitly modeling the latent velocity space. To overcome the suboptimal performance of flow matching baseline, we propose latent shortcut learning and beta distribution time stamp sampling during training to enhance gesture synthesis quality and accelerate inference. Combining the spatial-temporal modeling and improved flow matching-based framework, GestureLSM achieves state-of-the-art performance on BEAT2 while significantly reducing inference time compared to existing methods, highlighting its potential for enhancing digital humans and embodied agents in real-world applications. Project Page: https://andypinxinliu.github.io/GestureLSM
Related papers
- Electromyography-Based Gesture Recognition: Hierarchical Feature Extraction for Enhanced Spatial-Temporal Dynamics [0.7083699704958353]
We propose a lightweight squeeze-excitation deep learning-based multi stream spatial temporal dynamics time-varying feature extraction approach.
The proposed model was tested on the Ninapro DB2, DB4, and DB5 datasets, achieving accuracy rates of 96.41%, 92.40%, and 93.34%, respectively.
arXiv Detail & Related papers (2025-04-04T07:11:12Z) - ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.
The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.
To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z) - DIDiffGes: Decoupled Semi-Implicit Diffusion Models for Real-time Gesture Generation from Speech [42.663766380488205]
DIDiffGes can synthesize high-quality, expressive gestures from speech using only a few sampling steps.
Our method outperforms state-of-the-art approaches in human likeness, appropriateness, and style correctness.
arXiv Detail & Related papers (2025-03-21T11:23:39Z) - Streaming Generation of Co-Speech Gestures via Accelerated Rolling Diffusion [0.881371061335494]
We introduce Accelerated Rolling Diffusion, a novel framework for streaming gesture generation.
RDLA restructures the noise schedule into a stepwise ladder, allowing multiple frames to be denoised simultaneously.
This significantly improves sampling efficiency while maintaining motion consistency, achieving up to a 2x speedup.
arXiv Detail & Related papers (2025-03-13T15:54:45Z) - Multi-Resolution Generative Modeling of Human Motion from Limited Data [3.5229503563299915]
We present a generative model that learns to synthesize human motion from limited training sequences.
The model adeptly captures human motion patterns by integrating skeletal convolution layers and a multi-scale architecture.
arXiv Detail & Related papers (2024-11-25T15:36:29Z) - Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting [16.782154479264126]
Predicting backbone-temporal traffic flow presents challenges due to complex interactions between temporal factors.
Existing approaches address these dimensions in isolation, neglecting their critical interdependencies.
In this paper, we introduce Sanonymous-Temporal Unitized Unitized Cell (ASTUC), a unified framework designed to capture both spatial and temporal dependencies.
arXiv Detail & Related papers (2024-11-14T07:34:31Z) - Radio Frequency Signal based Human Silhouette Segmentation: A Sequential Diffusion Approach [26.987963024941635]
We propose a two-stage Sequential Diffusion Model (SDM) to synthesize high-quality segmentation jointly.
Cross-view blocks are devised to guide the diffusion model in a multi-scale manner.
temporal blocks are devised to fine-tune the frame-level model to incorporate frequency-temporal contexts and motion dynamics.
arXiv Detail & Related papers (2024-07-27T12:44:21Z) - Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals [5.283718601431859]
Invasive brain-computer interfaces with Electrocorticography (ECoG) have shown promise for high-performance speech decoding in medical applications.
We developed the Du-IN model, which extracts contextual embeddings based on region-level tokens through discrete codex-guided mask modeling.
Our model achieves state-of-the-art performance on the 61-word classification task, surpassing all baselines.
arXiv Detail & Related papers (2024-05-19T06:00:36Z) - Stragglers-Aware Low-Latency Synchronous Federated Learning via Layer-Wise Model Updates [71.81037644563217]
Synchronous federated learning (FL) is a popular paradigm for collaborative edge learning.
As some of the devices may have limited computational resources and varying availability, FL latency is highly sensitive to stragglers.
We propose straggler-aware layer-wise federated learning (SALF) that leverages the optimization procedure of NNs via backpropagation to update the global model in a layer-wise fashion.
arXiv Detail & Related papers (2024-03-27T09:14:36Z) - DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length.
By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z) - FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal.
To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z) - A Decoupled Spatio-Temporal Framework for Skeleton-based Action
Segmentation [89.86345494602642]
Existing methods are limited in weak-temporal modeling capability.
We propose a Decoupled Scoupled Framework (DeST) to address the issues.
DeST significantly outperforms current state-of-the-art methods with less computational complexity.
arXiv Detail & Related papers (2023-12-10T09:11:39Z) - DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for
Accelerated Seq2Seq Diffusion Models [58.450152413700586]
We introduce a soft absorbing state that facilitates the diffusion model in learning to reconstruct discrete mutations based on the underlying Gaussian space.
We employ state-of-the-art ODE solvers within the continuous space to expedite the sampling process.
Our proposed method effectively accelerates the training convergence by 4x and generates samples of similar quality 800x faster.
arXiv Detail & Related papers (2023-10-09T15:29:10Z) - Spatio-Temporal Branching for Motion Prediction using Motion Increments [55.68088298632865]
Human motion prediction (HMP) has emerged as a popular research topic due to its diverse applications.
Traditional methods rely on hand-crafted features and machine learning techniques.
We propose a noveltemporal-temporal branching network using incremental information for HMP.
arXiv Detail & Related papers (2023-08-02T12:04:28Z) - SLGTformer: An Attention-Based Approach to Sign Language Recognition [19.786769414376323]
Sign language is difficult to learn and represents a significant barrier for those who are hard of hearing or unable to speak.
We propose a novel, attention-based approach to Sign Language Recognition built upon deconstructing temporal graph self-attention.
We demonstrate the effectiveness of SLformer on the World-Level American Sign Language (WLASL) dataset.
arXiv Detail & Related papers (2022-12-21T03:30:43Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.