Related papers: Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

URL: http://arxiv.org/abs/2403.00691v1
Date: Fri, 1 Mar 2024 17:23:30 GMT
Title: Tri-Modal Motion Retrieval by Learning a Joint Embedding Space
Authors: Kangning Yin, Shihao Zou, Yuxuan Ge, Zheng Tian
Abstract summary: LAVIMO is a framework for three-modality learning integrating human-centric videos as an additional modality. Our results on the HumanML3D and KIT-ML datasets show that LAVIMO achieves state-of-the-art performance in various motion-related cross-modal retrieval tasks.
Score: 4.550873593248722
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Information retrieval is an ever-evolving and crucial research domain. The substantial demand for high-quality human motion data especially in online acquirement has led to a surge in human motion research works. Prior works have mainly concentrated on dual-modality learning, such as text and motion tasks, but three-modality learning has been rarely explored. Intuitively, an extra introduced modality can enrich a model's application scenario, and more importantly, an adequate choice of the extra modality can also act as an intermediary and enhance the alignment between the other two disparate modalities. In this work, we introduce LAVIMO (LAnguage-VIdeo-MOtion alignment), a novel framework for three-modality learning integrating human-centric videos as an additional modality, thereby effectively bridging the gap between text and motion. Moreover, our approach leverages a specially designed attention mechanism to foster enhanced alignment and synergistic effects among text, video, and motion modalities. Empirically, our results on the HumanML3D and KIT-ML datasets show that LAVIMO achieves state-of-the-art performance in various motion-related cross-modal retrieval tasks, including text-to-motion, motion-to-text, video-to-motion and motion-to-video.

Related papers

Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space [15.146062492621265]
Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation.<n>Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality.<n>We propose a framework that aligns four modalities -- text, audio, video, and motion -- within a fine-grained joint embedding space.
arXiv Detail & Related papers (2025-07-31T01:59:38Z)
MotionGPT3: Human Motion as a Second Modality [20.804747077748953]
We propose MotionGPT3, a bimodal motion-language model that treats human motion as a second modality.<n>To preserve language intelligence, the text branch retains the original structure and parameters of the pretrained language model.<n>Our approach achieves competitive performance on both motion understanding and generation tasks.
arXiv Detail & Related papers (2025-06-30T17:42:22Z)
UniHM: Universal Human Motion Generation with Object Interactions in Indoor Scenes [26.71077287710599]
We propose UniHM, a unified motion language model that leverages diffusion-based generation for scene-aware human motion.<n>UniHM is the first framework to support both Text-to-Motion and Text-to-Human-Object Interaction (HOI) in complex 3D scenes.<n>Our approach introduces three key contributions: (1) a mixed-motion representation that fuses continuous 6DoF motion with discrete local motion tokens to improve motion realism; (2) a novel Look-Up-Free Quantization VAE that surpasses traditional VQ-VAEs in both reconstruction accuracy and generative performance; and (3) an enriched version of
arXiv Detail & Related papers (2025-05-19T07:02:12Z)
MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding [76.30210465222218]
MotionGPT-2 is a unified Large Motion-Language Model (LMLMLM) It supports multimodal control conditions through pre-trained Large Language Models (LLMs) It is highly adaptable to the challenging 3D holistic motion generation task.
arXiv Detail & Related papers (2024-10-29T05:25:34Z)
Versatile Motion Language Models for Multi-Turn Interactive Agents [28.736843383405603]
We introduce Versatile Interactive Motion language model, which integrates both language and motion modalities. We evaluate the versatility of our method across motion-related tasks, motion to text, text to motion, reaction generation, motion editing, and reasoning about motion sequences.
arXiv Detail & Related papers (2024-10-08T02:23:53Z)
Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval [4.454835029368504]
We focus on the recently introduced text-motion retrieval which aim to search for sequences that are most relevant to a natural motion description. Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models. We propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously. We also introduce a transformer-based motion encoder, called MoT++, which employs the specified-temporal attention to process sequences of skeleton data.
arXiv Detail & Related papers (2024-07-02T09:43:47Z)
MotionLLM: Understanding Human Behaviors from Human Motions and Videos [40.132643319573205]
This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding. We present MotionLLM, a framework for human motion understanding, captioning, and reasoning.
arXiv Detail & Related papers (2024-05-30T17:59:50Z)
DualTalker: A Cross-Modal Dual Learning Approach for Speech-Driven 3D Facial Animation [10.73030153404956]
We propose a cross-modal dual-learning framework, termed DualTalker, to improve data usage efficiency. The framework is trained jointly with the primary task (audio-driven facial animation) and its dual task (lip reading) and shares common audio/motion encoder components. Our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-08T15:39:56Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)
Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation. M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse. We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z)
MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis. We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework. We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z)
TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions. We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data. We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z)
Relational Graph Learning on Visual and Kinematics Embeddings for Accurate Gesture Recognition in Robotic Surgery [84.73764603474413]
We propose a novel online approach of multi-modal graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information. The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset.
arXiv Detail & Related papers (2020-11-03T11:00:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.