SoPo: Text-to-Motion Generation Using Semi-Online Preference Optimization
- URL: http://arxiv.org/abs/2412.05095v1
- Date: Fri, 06 Dec 2024 14:50:38 GMT
- Title: SoPo: Text-to-Motion Generation Using Semi-Online Preference Optimization
- Authors: Xiaofeng Tan, Hongsong Wang, Xin Geng, Pan Zhou,
- Abstract summary: We focus on fine-tuning text-to-motion models to consistently favor high-quality, human-preferred motions.
In this work, we theoretically investigate the DPO under both online and offline settings.
We introduce Semi-online Preference Optimization (SoPo), a DPO-based method for training text-to-motion models.
- Score: 82.83603957387442
- License:
- Abstract: Text-to-motion generation is essential for advancing the creative industry but often presents challenges in producing consistent, realistic motions. To address this, we focus on fine-tuning text-to-motion models to consistently favor high-quality, human-preferred motions, a critical yet largely unexplored problem. In this work, we theoretically investigate the DPO under both online and offline settings, and reveal their respective limitation: overfitting in offline DPO, and biased sampling in online DPO. Building on our theoretical insights, we introduce Semi-online Preference Optimization (SoPo), a DPO-based method for training text-to-motion models using "semi-online" data pair, consisting of unpreferred motion from online distribution and preferred motion in offline datasets. This method leverages both online and offline DPO, allowing each to compensate for the other's limitations. Extensive experiments demonstrate that SoPo outperforms other preference alignment methods, with an MM-Dist of 3.25% (vs e.g. 0.76% of MoDiPO) on the MLD model, 2.91% (vs e.g. 0.66% of MoDiPO) on MDM model, respectively. Additionally, the MLD model fine-tuned by our SoPo surpasses the SoTA model in terms of R-precision and MM Dist. Visualization results also show the efficacy of our SoPo in preference alignment. Our project page is https://sopo-motion.github.io.
Related papers
- CHiP: Cross-modal Hierarchical Direct Preference Optimization for Multimodal LLMs [107.21334626890713]
Multimodal Large Language Models (MLLMs) still struggle with hallucinations despite their impressive capabilities.
We propose a Cross-modal Hierarchical Direct Preference Optimization (CHiP) to address these limitations.
We evaluate CHiP through both quantitative and qualitative analyses, with results across multiple benchmarks demonstrating its effectiveness in reducing hallucinations.
arXiv Detail & Related papers (2025-01-28T02:05:38Z) - Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.
We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.
We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z) - Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data.
Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z) - MoDiPO: text-to-motion alignment via AI-feedback-driven Direct Preference Optimization [6.147750347011554]
We propose MoDiPO (Motion Diffusion DPO) to align text-to-motion models.
We streamline the laborious and expensive process of gathering human preferences needed in DPO by leveraging AI feedback.
We demonstrate, both qualitatively and quantitatively, that our proposed method yields significantly more realistic motions.
arXiv Detail & Related papers (2024-05-06T19:19:20Z) - Weak-to-Strong Extrapolation Expedites Alignment [135.12769233630362]
We propose a method called ExPO to boost models' alignment with human preference.
We demonstrate that ExPO consistently improves off-the-shelf DPO/RLHF models.
We shed light on the essence of ExPO amplifying the reward signal learned during alignment training.
arXiv Detail & Related papers (2024-04-25T17:39:50Z) - Human Alignment of Large Language Models through Online Preference
Optimisation [50.52545798589968]
We show the equivalence between two recent alignment methods, namely Identity Policy optimisation (IPO) and Nash Mirror Descent (Nash-MD)
This equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model.
We introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm.
arXiv Detail & Related papers (2024-03-13T15:47:26Z) - Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive [15.066029556877721]
We show theoretically that the standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples.
We design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode.
Surprisingly, we find that DPOP outperforms DPO and other fine-tuning procedures across a wide variety of datasets and downstream tasks.
arXiv Detail & Related papers (2024-02-20T18:42:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.