MOST: Motion Diffusion Model for Rare Text via Temporal Clip Banzhaf Interaction
- URL: http://arxiv.org/abs/2507.06590v1
- Date: Wed, 09 Jul 2025 06:51:36 GMT
- Title: MOST: Motion Diffusion Model for Rare Text via Temporal Clip Banzhaf Interaction
- Authors: Yin Wang, Mu li, Zhiying Leng, Frederick W. B. Li, Xiaohui Liang,
- Abstract summary: We introduce MOST, a novel motion diffusion model via temporal clip Banzhaf interaction.<n>Most achieves state-of-the-art text-to-motion retrieval and generation performance.
- Score: 17.056288109274327
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce MOST, a novel motion diffusion model via temporal clip Banzhaf interaction, aimed at addressing the persistent challenge of generating human motion from rare language prompts. While previous approaches struggle with coarse-grained matching and overlook important semantic cues due to motion redundancy, our key insight lies in leveraging fine-grained clip relationships to mitigate these issues. MOST's retrieval stage presents the first formulation of its kind - temporal clip Banzhaf interaction - which precisely quantifies textual-motion coherence at the clip level. This facilitates direct, fine-grained text-to-motion clip matching and eliminates prevalent redundancy. In the generation stage, a motion prompt module effectively utilizes retrieved motion clips to produce semantically consistent movements. Extensive evaluations confirm that MOST achieves state-of-the-art text-to-motion retrieval and generation performance by comprehensively addressing previous challenges, as demonstrated through quantitative and qualitative results highlighting its effectiveness, especially for rare prompts.
Related papers
- When Less Is More: A Sparse Facial Motion Structure For Listening Motion Learning [1.2974519529978974]
This study proposes a novel method for representing and predicting non-verbal facial motion by encoding long sequences into a sparse sequence of listenings and transition frames.<n>By identifying crucial motion steps and interpolating intermediate frames, our method preserves the temporal structure of motion while enhancing instance-wise diversity during the learning process.
arXiv Detail & Related papers (2025-04-08T07:25:12Z) - Text2Story: Advancing Video Storytelling with Text Guidance [20.51001299249891]
We introduce a novel storytelling approach to enable seamless video generation with natural action transitions and structured narratives.<n>Our approach bridges the gap between short clips and extended video to establish a new paradigm in GenAI-driven video synthesis from text.
arXiv Detail & Related papers (2025-03-08T19:04:36Z) - Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model [64.11605839142348]
We introduce the textbfMotion-priors textbfConditional textbfDiffusion textbfModel (textbfMCDM), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency.<n>We also release the textbfTalkingFace-Wild dataset, a multilingual collection of over 200 hours of footage across 10 languages.
arXiv Detail & Related papers (2025-02-13T17:50:23Z) - Chronologically Accurate Retrieval for Temporal Grounding of Motion-Language Models [12.221087476416056]
We propose Chronologically Accurate Retrieval to evaluate the chronological understanding of motion-language models.
We decompose textual descriptions into events, and prepare negative text samples by shuffling the order of events in compound action descriptions.
We then design a simple task for motion-language models to retrieve the more likely text from the ground truth and its chronologically shuffled version.
arXiv Detail & Related papers (2024-07-22T06:25:21Z) - Joint-Dataset Learning and Cross-Consistent Regularization for Text-to-Motion Retrieval [4.454835029368504]
We focus on the recently introduced text-motion retrieval which aim to search for sequences that are most relevant to a natural motion description.
Despite recent efforts to explore these promising avenues, a primary challenge remains the insufficient data available to train robust text-motion models.
We propose to investigate joint-dataset learning - where we train on multiple text-motion datasets simultaneously.
We also introduce a transformer-based motion encoder, called MoT++, which employs the specified-temporal attention to process sequences of skeleton data.
arXiv Detail & Related papers (2024-07-02T09:43:47Z) - Learning Generalizable Human Motion Generator with Reinforcement Learning [95.62084727984808]
Text-driven human motion generation is one of the vital tasks in computer-aided content creation.
Existing methods often overfit specific motion expressions in the training data, hindering their ability to generalize.
We present textbfInstructMotion, which incorporate the trail and error paradigm in reinforcement learning for generalizable human motion generation.
arXiv Detail & Related papers (2024-05-24T13:29:12Z) - FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint.
This enhancement ensures a more consistent transformation of semantically similar content across frames.
Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z) - DiffusionPhase: Motion Diffusion in Frequency Domain [69.811762407278]
We introduce a learning-based method for generating high-quality human motion sequences from text descriptions.
Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences.
We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space.
arXiv Detail & Related papers (2023-12-07T04:39:22Z) - Counterfactual Cross-modality Reasoning for Weakly Supervised Video
Moment Localization [67.88493779080882]
Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query.
Recent works contrast the cross-modality similarities driven by reconstructing masked queries.
We propose a novel proposed counterfactual cross-modality reasoning method.
arXiv Detail & Related papers (2023-08-10T15:45:45Z) - Cross-Sentence Temporal and Semantic Relations in Video Activity
Localisation [79.50868197788773]
We develop a more accurate weakly-supervised solution by introducing Cross-Sentence Relations Mining.
We explore two cross-sentence relational constraints: (1) trimmed ordering and (2) semantic consistency among sentences in a paragraph description of video activities.
Experiments on two publicly available activity localisation datasets show the advantages of our approach over the state-of-the-art weakly supervised methods.
arXiv Detail & Related papers (2021-07-23T20:04:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.