Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization
- URL: http://arxiv.org/abs/2502.16825v1
- Date: Mon, 24 Feb 2025 04:22:57 GMT
- Title: Finding the Sweet Spot: Preference Data Construction for Scaling Preference Optimization
- Authors: Yao Xiao, Hai Ye, Linyao Chen, Hwee Tou Ng, Lidong Bing, Xiaoli Li, Roy Ka-wei Lee,
- Abstract summary: We aim to emphscale up the number of on-policy samples via repeated random sampling to improve alignment performance.<n>Our experiments reveal that this strategy leads to a emphdecline in performance as the sample size increases.<n>We introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.
- Score: 66.67988187816185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Iterative data generation and model retraining are widely used to align large language models (LLMs). It typically involves a policy model to generate on-policy responses and a reward model to guide training data selection. Direct Preference Optimization (DPO) further enhances this process by constructing preference pairs of chosen and rejected responses. In this work, we aim to \emph{scale up} the number of on-policy samples via repeated random sampling to improve alignment performance. Conventional practice selects the sample with the highest reward as chosen and the lowest as rejected for DPO. However, our experiments reveal that this strategy leads to a \emph{decline} in performance as the sample size increases. To address this, we investigate preference data construction through the lens of underlying normal distribution of sample rewards. We categorize the reward space into seven representative points and systematically explore all 21 ($C_7^2$) pairwise combinations. Through evaluations on four models using AlpacaEval 2, we find that selecting the rejected response at reward position $\mu - 2\sigma$ rather than the minimum reward, is crucial for optimal performance. We finally introduce a scalable preference data construction strategy that consistently enhances model performance as the sample scale increases.
Related papers
- Preference Optimization via Contrastive Divergence: Your Reward Model is Secretly an NLL Estimator [32.05337749590184]
We develop a novel PO framework that provides theoretical guidance to effectively sample dispreferred completions.<n>We then select contrastive divergence (CD) as sampling strategy, and propose a novel MC-PO algorithm.<n>OnMC-PO outperforms existing SOTA baselines, and OnMC-PO leads to further improvement.
arXiv Detail & Related papers (2025-02-06T23:45:08Z) - Calibrated Multi-Preference Optimization for Aligning Diffusion Models [92.90660301195396]
Calibrated Preference Optimization (CaPO) is a novel method to align text-to-image (T2I) diffusion models.<n>CaPO incorporates the general preference from multiple reward models without human annotated data.<n> Experimental results show that CaPO consistently outperforms prior methods.
arXiv Detail & Related papers (2025-02-04T18:59:23Z) - SePPO: Semi-Policy Preference Optimization for Diffusion Alignment [67.8738082040299]
We propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data.
We validate SePPO across both text-to-image and text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z) - Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment [51.14207112118503]
We introduce preference embedding, an approach that embeds responses into a latent space to capture preferences efficiently.<n>We also propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback.<n>Our method may enhance the alignment of foundation models with nuanced human values.
arXiv Detail & Related papers (2024-10-03T04:22:55Z) - The Crucial Role of Samplers in Online Direct Preference Optimization [36.68862142959827]
We provide a rigorous analysis of DPO's convergence rates with different sampling strategies under the exact gradient setting.
Our proposed online sampler achieves $textbfquadratic$ convergence, while uniform sampling achieves $textbflinear$ convergence.
For example, it outperforms vanilla DPO by over $7.4$% on Safe-RLHF dataset.
arXiv Detail & Related papers (2024-09-29T07:53:50Z) - Preference-Guided Reflective Sampling for Aligning Language Models [27.69410513313001]
Iterative data generation and model re-training can effectively align large language models(LLMs) to human preferences.
In this work, we propose Preference-Guided Reflective Sampling (PRS)
Unlike random sampling, PRS employs a tree-based generation framework to enable more efficient sampling.
PRS shows strong performance when applied in iterative offline RL training.
arXiv Detail & Related papers (2024-08-22T07:18:46Z) - SelectAugment: Hierarchical Deterministic Sample Selection for Data
Augmentation [72.58308581812149]
We propose an effective approach, dubbed SelectAugment, to select samples to be augmented in a deterministic and online manner.
Specifically, in each batch, we first determine the augmentation ratio, and then decide whether to augment each training sample under this ratio.
In this way, the negative effects of the randomness in selecting samples to augment can be effectively alleviated and the effectiveness of DA is improved.
arXiv Detail & Related papers (2021-12-06T08:38:38Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - One for More: Selecting Generalizable Samples for Generalizable ReID
Model [92.40951770273972]
This paper proposes a one-for-more training objective that takes the generalization ability of selected samples as a loss function.
Our proposed one-for-more based sampler can be seamlessly integrated into the ReID training framework.
arXiv Detail & Related papers (2020-12-10T06:37:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.