Semi-Supervised Preference Optimization with Limited Feedback
- URL: http://arxiv.org/abs/2511.00040v1
- Date: Tue, 28 Oct 2025 01:33:43 GMT
- Title: Semi-Supervised Preference Optimization with Limited Feedback
- Authors: Seonggyun Lee, Sungjun Lim, Seojin Park, Soeun Cheon, Kyungwoo Song,
- Abstract summary: We study the problem of Semi-Supervised Preference Optimization (SSPO) in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously.<n>Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability.<n>By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs.
- Score: 17.112054023380647
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization (SSPO) in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs. Extensive experiments across datasets validate this remarkable data efficiency; for instance, SSPO trained with Llama3-8B-Instruct on just 1% of UltraFeedback consistently surpasses strong baselines trained on 10% of UltraFeedback.
Related papers
- When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets [29.94723846950853]
We present the first comprehensive, data-centric analysis of popular open-source DPO corpora.<n>We leverage the Magpie framework to annotate each sample for task category, input quality, and preference reward.<n>This enables a scalable, fine-grained inspection of preference quality across datasets, revealing structural and qualitative discrepancies in reward margins.
arXiv Detail & Related papers (2025-11-14T06:12:16Z) - Beyond Single: A Data Selection Principle for LLM Alignment via Fine-Grained Preference Signals [46.58760908162995]
We propose a novel, theoretically-grounded data selection principle for large language models.<n>We prove the optimality of this strategy by analyzing the loss bounds of the Direct Preference Optimization objective.<n>Our strategy achieves over 10% relative improvement against both the standard holistic preference and a stronger oracle.
arXiv Detail & Related papers (2025-08-11T05:43:02Z) - Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data [30.407699113696076]
Supervised Fine-Tuning (SFT) plays a pivotal role in adapting Large Language Models (LLMs) to specialized domains such as medical reasoning.<n>Existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by knowledge and reasoning complexity.<n>We propose a data selection strategy, Difficulty-Influence Quadrant (DIQ), which prioritizes samples in the high-difficulty-high-influence quadrant.
arXiv Detail & Related papers (2025-08-02T17:50:35Z) - A Systematic Examination of Preference Learning through the Lens of Instruction-Following [83.71180850955679]
We use a novel synthetic data generation pipeline to generate 48,000 instruction unique-following prompts.<n>With our synthetic prompts, we use two preference dataset curation methods - rejection sampling (RS) and Monte Carlo Tree Search (MCTS)<n>Experiments reveal that shared prefixes in preference pairs, as generated by MCTS, provide marginal but consistent improvements.<n>High-contrast preference pairs generally outperform low-contrast pairs; however, combining both often yields the best performance.
arXiv Detail & Related papers (2024-12-18T15:38:39Z) - Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.<n>We show that our approach consistently boosts DPO by a considerable margin.<n>Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights [73.9088920210495]
We propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward.<n>TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks.
arXiv Detail & Related papers (2024-10-06T04:03:00Z) - Self-supervised Preference Optimization: Enhance Your Language Model with Preference Degree Awareness [27.43137305486112]
We propose a novel Self-supervised Preference Optimization (SPO) framework, which constructs a self-supervised preference degree loss combined with the alignment loss.
The results demonstrate that SPO can be seamlessly integrated with existing preference optimization methods to achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-09-26T12:37:26Z) - Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment [72.99676237703099]
We propose a new framework that boosts the alignment of large language models with human preferences.<n>Our key idea is leveraging the human prior knowledge within the small (seed) data.<n>We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.