Text-Driven 3D Hand Motion Generation from Sign Language Data
- URL: http://arxiv.org/abs/2508.15902v1
- Date: Thu, 21 Aug 2025 18:02:47 GMT
- Title: Text-Driven 3D Hand Motion Generation from Sign Language Data
- Authors: Léore Bensabath, Mathis Petrovich, Gül Varol,
- Abstract summary: Our goal is to train a generative model of 3D hand motions conditioned on natural language descriptions.<n>We leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories.<n>This data enables training a text-conditioned hand motion diffusion model HandMDM.
- Score: 22.857120389410465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Our goal is to train a generative model of 3D hand motions, conditioned on natural language descriptions specifying motion characteristics such as handshapes, locations, finger/hand/arm movements. To this end, we automatically build pairs of 3D hand motions and their associated textual labels with unprecedented scale. Specifically, we leverage a large-scale sign language video dataset, along with noisy pseudo-annotated sign categories, which we translate into hand motion descriptions via an LLM that utilizes a dictionary of sign attributes, as well as our complementary motion-script cues. This data enables training a text-conditioned hand motion diffusion model HandMDM, that is robust across domains such as unseen sign categories from the same sign language, but also signs from another sign language and non-sign hand movements. We contribute extensive experimental investigation of these scenarios and will make our trained models and data publicly available to support future research in this relatively new field.
Related papers
- CLUTCH: Contextualized Language model for Unlocking Text-Conditioned Hand motion modelling in the wild [41.0322780136795]
We introduce '3D Hands in the Wild' (3D-HIW), a dataset of 32K 3D hand-motion sequences and aligned text.<n>We then propose CLUTCH, an LLM-based hand animation system with two critical innovations: (a) SHIFT, a novel VQ-VAE architecture to tokenize hand motion, and (b) a geometric refinement stage to finetune the LLM.<n> Experiments demonstrate state-of-the-art performance on text-to-motion and motion-to-text tasks, establishing the first benchmark for scalable in-the-wild hand motion modelling
arXiv Detail & Related papers (2026-02-19T19:02:22Z) - Hierarchical Motion Captioning Utilizing External Text Data Source [1.2072875728151518]
This paper introduces a novel approach to enhance existing motion captioning methods.<n>We use large language models to create detailed descriptions corresponding to each high-level caption that appears in the motion-text datasets.<n>It aligns the detailed low-level captions with candidate high-level captions from additional text data sources, and combine them with motion features to fabricate precise high-level captions.
arXiv Detail & Related papers (2025-09-01T13:39:14Z) - ReAlign: Bilingual Text-to-Motion Generation via Step-Aware Reward-Guided Alignment [48.894439350114396]
We propose a novel bilingual human motion dataset, BiHumanML3D, which establishes a crucial benchmark for bilingual text-to-motion generation models.<n>We also propose a Bilingual Motion Diffusion model (BiMD), which leverages cross-lingual aligned representations to capture semantics.<n>We show that our approach significantly improves text-motion alignment and motion quality compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2025-05-08T06:19:18Z) - Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues [56.36041287155606]
We investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling.<n>To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE.<n>Results show that incorporating gestures enhances marker prediction accuracy across the three tasks.
arXiv Detail & Related papers (2025-03-05T13:10:07Z) - Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.<n>We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z) - MotionFix: Text-Driven 3D Human Motion Editing [52.11745508960547]
Key challenges include the scarcity of training data and the need to design a model that accurately edits the source motion.
We propose a methodology to semi-automatically collect a dataset of triplets comprising (i) a source motion, (ii) a target motion, and (iii) an edit text.
Access to this data allows us to train a conditional diffusion model, TMED, that takes both the source motion and the edit text as input.
arXiv Detail & Related papers (2024-08-01T16:58:50Z) - EvSign: Sign Language Recognition and Translation with Streaming Events [59.51655336911345]
Event camera could naturally perceive dynamic hand movements, providing rich manual clues for sign language tasks.
We propose efficient transformer-based framework for event-based SLR and SLT tasks.
Our method performs favorably against existing state-of-the-art approaches with only 0.34% computational cost.
arXiv Detail & Related papers (2024-07-17T14:16:35Z) - SignAvatar: Sign Language 3D Motion Reconstruction and Generation [10.342253593687781]
SignAvatar is a framework capable of both word-level sign language reconstruction and generation.<n>We contribute the ASL3DWord dataset, composed of 3D joint rotation data for the body, hands, and face.
arXiv Detail & Related papers (2024-05-13T17:48:22Z) - Motion Generation from Fine-grained Textual Descriptions [29.033358642532722]
We build a large-scale language-motion dataset specializing in fine-grained textual descriptions, FineHumanML3D.
We design a new text2motion model, FineMotionDiffuse, making full use of fine-grained textual information.
Our evaluation shows that FineMotionDiffuse trained on FineHumanML3D improves FID by a large margin of 0.38, compared with competitive baselines.
arXiv Detail & Related papers (2024-03-20T11:38:30Z) - MotionScript: Natural Language Descriptions for Expressive 3D Human Motions [8.050271017133076]
We introduce MotionScript, a novel framework for generating highly detailed, natural language descriptions of 3D human motions.<n>MotionScript provides fine-grained, structured descriptions that capture the full complexity of human movement.<n> MotionScript serves as both a descriptive tool and a training resource for text-to-motion models.
arXiv Detail & Related papers (2023-12-19T22:33:17Z) - BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics [50.88842027976421]
We propose BOTH57M, a novel multi-modal dataset for two-hand motion generation.
Our dataset includes accurate motion tracking for the human body and hands.
We also provide a strong baseline method, BOTH2Hands, for the novel task.
arXiv Detail & Related papers (2023-12-13T07:30:19Z) - MotionGPT: Human Motion as a Foreign Language [47.21648303282788]
Human motion displays a semantic coupling akin to human language, often perceived as a form of body language.
By fusing language data with large-scale motion models, motion-language pre-training can enhance the performance of motion-related tasks.
We propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks.
arXiv Detail & Related papers (2023-06-26T15:53:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.