Related papers: The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion

The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion

URL: http://arxiv.org/abs/2412.10523v1
Date: Fri, 13 Dec 2024 19:33:48 GMT
Title: The Language of Motion: Unifying Verbal and Non-verbal Language of 3D Human Motion
Authors: Changan Chen, Juze Zhang, Shrinidhi K. Lakshmikanth, Yusu Fang, Ruizhi Shao, Gordon Wetzstein, Li Fei-Fei, Ehsan Adeli,
Abstract summary: We propose a novel framework that unifies verbal and non-verbal language using multimodal language models.<n>Our model achieves state-of-the-art performance on co-speech gesture generation.<n>We believe unifying the verbal and non-verbal language of human motion is essential for real-world applications.
Score: 46.01825432018138
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human communication is inherently multimodal, involving a combination of verbal and non-verbal cues such as speech, facial expressions, and body gestures. Modeling these behaviors is essential for understanding human interaction and for creating virtual characters that can communicate naturally in applications like games, films, and virtual reality. However, existing motion generation models are typically limited to specific input modalities -- either speech, text, or motion data -- and cannot fully leverage the diversity of available data. In this paper, we propose a novel framework that unifies verbal and non-verbal language using multimodal language models for human motion understanding and generation. This model is flexible in taking text, speech, and motion or any combination of them as input. Coupled with our novel pre-training strategy, our model not only achieves state-of-the-art performance on co-speech gesture generation but also requires much less data for training. Our model also unlocks an array of novel tasks such as editable gesture generation and emotion prediction from motion. We believe unifying the verbal and non-verbal language of human motion is essential for real-world applications, and language models offer a powerful approach to achieving this goal. Project page: languageofmotion.github.io.

Related papers

MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z)
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body [25.087113475162592]
ViBES (Voice in Behavioral Expression and Synchrony) is a conversational 3D agent that jointly plans language and movement.<n>ViBES goes beyond "speech-conditioned motion generation" toward agentic virtual bodies where language, prosody, and movement are jointly generated.
arXiv Detail & Related papers (2025-12-16T09:41:21Z)
ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input [0.0]
This paper tackles a core challenge in co-speech gesture synthesis: generating iconic or deictic gestures that are semantically coherent with a verbal utterance.<n>We introduce a zero-shot system that generates gestures from a given language input and additionally is informed by imagistic input, without manual annotation or human intervention.<n>Our results highlight the importance of context-aware semantic gestures for creating expressive and collaborative virtual agents or avatars.
arXiv Detail & Related papers (2025-10-20T15:01:56Z)
Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [113.25650486482762]
We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
arXiv Detail & Related papers (2025-06-27T18:09:49Z)
Motion-Agent: A Conversational Framework for Human Motion Generation with LLMs [67.59291068131438]
Motion-Agent is a conversational framework designed for general human motion generation, editing, and understanding. Motion-Agent employs an open-source pre-trained language model to develop a generative agent, MotionLLM, that bridges the gap between motion and text.
arXiv Detail & Related papers (2024-05-27T09:57:51Z)
Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models. Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model. To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z)
MotionChain: Conversational Motion Controllers via Multimodal Prompts [25.181069337771127]
We present MotionChain, a conversational human motion controller to generate continuous and long-term human motion through multimodal prompts. By leveraging large-scale language, vision-language, and vision-motion data, MotionChain comprehends each instruction in multi-turn conversation and generates human motions followed by these prompts.
arXiv Detail & Related papers (2024-04-02T07:09:29Z)
ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z)
MotionScript: Natural Language Descriptions for Expressive 3D Human Motions [8.050271017133076]
We introduce MotionScript, a novel framework for generating highly detailed, natural language descriptions of 3D human motions. MotionScript provides fine-grained, structured descriptions that capture the full complexity of human movement. MotionScript serves as both a descriptive tool and a training resource for text-to-motion models.
arXiv Detail & Related papers (2023-12-19T22:33:17Z)
MotionGPT: Human Motion as a Foreign Language [47.21648303282788]
Human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training can enhance the performance of motion-related tasks. We propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks.
arXiv Detail & Related papers (2023-06-26T15:53:02Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
Passing a Non-verbal Turing Test: Evaluating Gesture Animations Generated from Speech [6.445605125467574]
In this paper, we propose a novel, data-driven technique for generating gestures directly from speech. Our approach is based on the application of Generative Adversarial Neural Networks (GANs) to model the correlation rather than causation between speech and gestures. For the study, we animate the generated gestures on a virtual character. We find that users are not able to distinguish between the generated and the recorded gestures.
arXiv Detail & Related papers (2021-07-01T19:38:43Z)
Speech Gesture Generation from the Trimodal Context of Text, Audio, and Speaker Identity [21.61168067832304]
We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures. Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
arXiv Detail & Related papers (2020-09-04T11:42:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.