MoDiPO: text-to-motion alignment via AI-feedback-driven Direct Preference Optimization
- URL: http://arxiv.org/abs/2405.03803v1
- Date: Mon, 6 May 2024 19:19:20 GMT
- Title: MoDiPO: text-to-motion alignment via AI-feedback-driven Direct Preference Optimization
- Authors: Massimiliano Pappa, Luca Collorone, Giovanni Ficarra, Indro Spinelli, Fabio Galasso,
- Abstract summary: We propose MoDiPO (Motion Diffusion DPO) to align text-to-motion models.
We streamline the laborious and expensive process of gathering human preferences needed in DPO by leveraging AI feedback.
We demonstrate, both qualitatively and quantitatively, that our proposed method yields significantly more realistic motions.
- Score: 6.147750347011554
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion Models have revolutionized the field of human motion generation by offering exceptional generation quality and fine-grained controllability through natural language conditioning. Their inherent stochasticity, that is the ability to generate various outputs from a single input, is key to their success. However, this diversity should not be unrestricted, as it may lead to unlikely generations. Instead, it should be confined within the boundaries of text-aligned and realistic generations. To address this issue, we propose MoDiPO (Motion Diffusion DPO), a novel methodology that leverages Direct Preference Optimization (DPO) to align text-to-motion models. We streamline the laborious and expensive process of gathering human preferences needed in DPO by leveraging AI feedback instead. This enables us to experiment with novel DPO strategies, using both online and offline generated motion-preference pairs. To foster future research we contribute with a motion-preference dataset which we dub Pick-a-Move. We demonstrate, both qualitatively and quantitatively, that our proposed method yields significantly more realistic motions. In particular, MoDiPO substantially improves Frechet Inception Distance (FID) while retaining the same RPrecision and Multi-Modality performances.
Related papers
- SoPo: Text-to-Motion Generation Using Semi-Online Preference Optimization [82.83603957387442]
We focus on fine-tuning text-to-motion models to consistently favor high-quality, human-preferred motions.
In this work, we theoretically investigate the DPO under both online and offline settings.
We introduce Semi-online Preference Optimization (SoPo), a DPO-based method for training text-to-motion models.
arXiv Detail & Related papers (2024-12-06T14:50:38Z) - GDPO: Learning to Directly Align Language Models with Diversity Using GFlowNets [19.485572131953937]
We propose a practical application of a diversity-seeking RL algorithm called GFlowNet-DPO (GDPO) in an offline preference alignment setting.
Empirical results show GDPO can generate far more diverse responses than the baseline methods.
arXiv Detail & Related papers (2024-10-19T13:07:52Z) - Diffusion-RPO: Aligning Diffusion Models through Relative Preference Optimization [68.69203905664524]
We introduce Diffusion-RPO, a new method designed to align diffusion-based T2I models with human preferences more effectively.
We have developed a new evaluation metric, style alignment, aimed at overcoming the challenges of high costs, low interpretability.
Our findings demonstrate that Diffusion-RPO outperforms established methods such as Supervised Fine-Tuning and Diffusion-DPO in tuning Stable Diffusion versions 1.5 and XL-1.0.
arXiv Detail & Related papers (2024-06-10T15:42:03Z) - Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF [82.7679132059169]
Reinforcement learning from human feedback has emerged as a central tool for language model alignment.
We propose a new algorithm for online exploration in RLHF, Exploratory Preference Optimization (XPO)
XPO enjoys the strongest known provable guarantees and promising empirical performance.
arXiv Detail & Related papers (2024-05-31T17:39:06Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Language model (LM) post-training involves maximizing a reward function that is derived from preference annotations.
DPO is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning.
We analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs.
arXiv Detail & Related papers (2024-05-29T17:39:48Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Direct Preference Optimization With Unobserved Preference Heterogeneity [16.91835461818937]
This paper presents a new method to align generative models with varied human preferences.
We propose an Expectation-Maximization adaptation to DPO, generating a mixture of models based on latent preference types of the annotators.
Our algorithms leverage the simplicity of DPO while accommodating diverse preferences.
arXiv Detail & Related papers (2024-05-23T21:25:20Z) - Aligning Large Language Models with Counterfactual DPO [1.8130068086063336]
This paper explores the utilization of counterfactual prompting to align the model's style without relying on human intervention.
We demonstrate that this method effectively instils desirable behaviour, mitigates undesirable ones, and encourages the model to disregard inappropriate instructions.
arXiv Detail & Related papers (2024-01-17T19:43:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.