Related papers: Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space

Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space

URL: http://arxiv.org/abs/2507.23188v1
Date: Thu, 31 Jul 2025 01:59:38 GMT
Title: Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space
Authors: Shiyao Yu, Zi-An Wang, Kangning Yin, Zheng Tian, Mingyuan Zhang, Weixin Si, Shihao Zou,
Abstract summary: Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation.<n>Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality.<n>We propose a framework that aligns four modalities -- text, audio, video, and motion -- within a fine-grained joint embedding space.
Score: 15.146062492621265
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities -- text, audio, video, and motion -- within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.

Related papers

M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.08520614570288]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z)
ExGes: Expressive Human Motion Retrieval and Modulation for Audio-Driven Gesture Synthesis [20.38933807616264]
ExGes is a novel retrieval-enhanced diffusion framework for gesture synthesis.<n>We show that ExGes reduces Fr'teche Distance by 6.2% and improves motion diversity by 5.3% over EMAGE.<n>We also show that user studies reveal a 71.3% preference for its naturalness and semantic relevance.
arXiv Detail & Related papers (2025-03-09T07:59:39Z)
InterDance:Reactive 3D Dance Generation with Realistic Duet Interactions [67.37790144477503]
We propose InterDance, a large-scale duet dance dataset that significantly enhances motion quality, data scale, and the variety of dance genres.<n>We introduce a diffusion-based framework with an interaction refinement guidance strategy to optimize the realism of interactions progressively.
arXiv Detail & Related papers (2024-12-22T11:53:51Z)
MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion [8.94802080815133]
MoRAG is a novel multi-part fusion based retrieval-augmented generation strategy for text-based human motion generation.<n>We create diverse samples through the spatial composition of the retrieved motions.<n>Our framework can serve as a plug-and-play module, improving the performance of motion diffusion models.
arXiv Detail & Related papers (2024-09-18T17:03:30Z)
Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval [4.454835029368504]
We focus on the recently introduced text-motion retrieval which aim to search for sequences that are most relevant to a natural motion description. Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models. We propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously. We also introduce a transformer-based motion encoder, called MoT++, which employs the specified-temporal attention to process sequences of skeleton data.
arXiv Detail & Related papers (2024-07-02T09:43:47Z)
Tri-Modal Motion Retrieval by Learning a Joint Embedding Space [4.550873593248722]
LAVIMO is a framework for three-modality learning integrating human-centric videos as an additional modality. Our results on the HumanML3D and KIT-ML datasets show that LAVIMO achieves state-of-the-art performance in various motion-related cross-modal retrieval tasks.
arXiv Detail & Related papers (2024-03-01T17:23:30Z)
SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD) The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z)
DiverseMotion: Towards Diverse Human Motion Generation via Discrete Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions. We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z)
ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model [33.64263969970544]
3D human motion generation is crucial for creative industry. Recent advances rely on generative models with domain knowledge for text-driven motion generation. We propose ReMoDiffuse, a diffusion-model-based motion generation framework.
arXiv Detail & Related papers (2023-04-03T16:29:00Z)
MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis. We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework. We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z)
Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field. It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations. Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.