No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts
- URL: http://arxiv.org/abs/2510.06988v1
- Date: Wed, 08 Oct 2025 13:12:10 GMT
- Title: No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts
- Authors: Girolamo Macaluso, Lorenzo Mandelli, Mirko Bicchierai, Stefano Berretti, Andrew D. Bagdanov,
- Abstract summary: We propose a post-training framework that fine-tunes pretrained motion diffusion models using only textual prompts.<n>Our approach is a flexible, data-efficient, and privacy-preserving solution for motion adaptation.
- Score: 16.05508249584636
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Diffusion models have recently advanced human motion generation, producing realistic and diverse animations from textual prompts. However, adapting these models to unseen actions or styles typically requires additional motion capture data and full retraining, which is costly and difficult to scale. We propose a post-training framework based on Reinforcement Learning that fine-tunes pretrained motion diffusion models using only textual prompts, without requiring any motion ground truth. Our approach employs a pretrained text-motion retrieval network as a reward signal and optimizes the diffusion policy with Denoising Diffusion Policy Optimization, effectively shifting the model's generative distribution toward the target domain without relying on paired motion data. We evaluate our method on cross-dataset adaptation and leave-one-out motion experiments using the HumanML3D and KIT-ML datasets across both latent- and joint-space diffusion architectures. Results from quantitative metrics and user studies show that our approach consistently improves the quality and diversity of generated motions, while preserving performance on the original distribution. Our approach is a flexible, data-efficient, and privacy-preserving solution for motion adaptation.
Related papers
- Fints: Efficient Inference-Time Personalization for LLMs with Fine-Grained Instance-Tailored Steering [49.212940215720884]
We propose a steering framework that generates sample-level interference from user data and injects it into the model's forward pass for personalized adaptation.<n>Our method significantly enhances personalization performance in fast-shifting environments while maintaining robustness across varying interaction modes and context lengths.
arXiv Detail & Related papers (2025-10-31T06:01:04Z) - Reinforcement Learning with Inverse Rewards for World Model Post-training [29.19830208692156]
We propose Reinforcement Learning with Inverse Rewards to improve action-following in video world models.<n>RLIR derives verifiable reward signals by recovering input actions from generated videos using an Inverse Dynamics Model.
arXiv Detail & Related papers (2025-09-28T16:27:47Z) - FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation [51.110607281391154]
FlowMo is a training-free guidance method for enhancing motion coherence in text-to-video models.<n>It estimates motion coherence by measuring the patch-wise variance across the temporal dimension and guides the model to reduce this variance dynamically during sampling.
arXiv Detail & Related papers (2025-06-01T19:55:33Z) - Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion [33.9786226622757]
We propose a robust motion generation framework MoMADiff to generate 3D human motion from text descriptions.<n>Our model supports flexible user-provided specification, enabling precise control over both spatial and temporal aspects of motion synthesis.<n>Our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and adherence.
arXiv Detail & Related papers (2025-05-16T09:06:15Z) - Pre-Trained Video Generative Models as World Simulators [59.546627730477454]
We propose Dynamic World Simulation (DWS) to transform pre-trained video generative models into controllable world simulators.<n>To achieve precise alignment between conditioned actions and generated visual changes, we introduce a lightweight, universal action-conditioned module.<n> Experiments demonstrate that DWS can be versatilely applied to both diffusion and autoregressive transformer models.
arXiv Detail & Related papers (2025-02-10T14:49:09Z) - Text-driven Human Motion Generation with Motion Masked Diffusion Model [23.637853270123045]
Text human motion generation is a task that synthesizes human motion sequences conditioned on natural language.
Current diffusion model-based approaches have outstanding performance in the diversity and multimodality of generation.
We propose Motion Masked Diffusion Model bftext(MMDM), a novel human motion mechanism for diffusion model.
arXiv Detail & Related papers (2024-09-29T12:26:24Z) - MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human Motion [8.94802080815133]
MoRAG is a novel multi-part fusion based retrieval-augmented generation strategy for text-based human motion generation.<n>We create diverse samples through the spatial composition of the retrieved motions.<n>Our framework can serve as a plug-and-play module, improving the performance of motion diffusion models.
arXiv Detail & Related papers (2024-09-18T17:03:30Z) - Adversarial Robustification via Text-to-Image Diffusion Models [56.37291240867549]
Adrial robustness has been conventionally believed as a challenging property to encode for neural networks.
We develop a scalable and model-agnostic solution to achieve adversarial robustness without using any data.
arXiv Detail & Related papers (2024-07-26T10:49:14Z) - Video Diffusion Models are Training-free Motion Interpreter and Controller [20.361790608772157]
This paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models.
We present a new MOtion FeaTure (MOFT) by eliminating content correlation information and filtering motion channels.
arXiv Detail & Related papers (2024-05-23T17:59:40Z) - Motion Flow Matching for Human Motion Synthesis and Editing [75.13665467944314]
We propose emphMotion Flow Matching, a novel generative model for human motion generation featuring efficient sampling and effectiveness in motion editing applications.
Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks.
arXiv Detail & Related papers (2023-12-14T12:57:35Z) - Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations [61.132408427908175]
zero-shot GAN adaptation aims to reuse well-trained generators to synthesize images of an unseen target domain.
With only a single representative text feature instead of real images, the synthesized images gradually lose diversity.
We propose a novel method to find semantic variations of the target text in the CLIP space.
arXiv Detail & Related papers (2023-08-21T08:12:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.