Related papers: Self-Boosting Large Language Models with Synthetic Preference Data

Self-Boosting Large Language Models with Synthetic Preference Data

URL: http://arxiv.org/abs/2410.06961v1
Date: Wed, 9 Oct 2024 14:57:31 GMT
Title: Self-Boosting Large Language Models with Synthetic Preference Data
Authors: Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, Furu Wei,
Abstract summary: We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities. SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.
Score: 97.94185115047999
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Through alignment with human preferences, Large Language Models (LLMs) have advanced significantly in generating honest, harmless, and helpful responses. However, collecting high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs. We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. SynPO employs an iterative mechanism wherein a self-prompt generator creates diverse prompts, and a response improver refines model responses progressively. This approach trains LLMs to autonomously learn the generative rewards for their own outputs and eliminates the need for large-scale annotation of prompts and human preferences. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Simultaneously, SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.

Related papers

Aligning Large Language Models via Fully Self-Synthetic Data [20.05693955243206]
Traditional reinforcement learning from human feedback (RLHF) for large language models (LLMs) relies on expensive human-annotated datasets.<n>In this work, we introduce Self-Alignment Optimization (SAO), a fully self-synthetic framework for LLM alignment.<n>Experiments demonstrate that SAO effectively enhances the model's chat capabilities on standard benchmarks like AlpacaEval2.0.
arXiv Detail & Related papers (2025-10-08T05:07:45Z)
Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning [14.254037571895404]
Large Language Models (LLMs) have demonstrated remarkable progress through preference-based fine-tuning.<n>This paper introduces Refine-n-Judge, an automated iterative approach that leverages a single LLM as both a refiner and a judge to enhance dataset quality.<n>We demonstrate the effectiveness of Refine-n-Judge across a range of public datasets spanning five corpora, targeting tasks such as coding, math, and conversation.
arXiv Detail & Related papers (2025-08-03T01:56:03Z)
Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning [80.27561080938747]
We propose a systematic framework, CANOE, to improve the faithfulness of large language models (LLMs) in both short-form and long-form generation tasks without human annotations.<n>Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data.<n> Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs.
arXiv Detail & Related papers (2025-05-22T10:10:07Z)
Improving Model Alignment Through Collective Intelligence of Open-Source LLMS [34.23134719050941]
We introduce MoAA, that leverages the collective strengths of various language models to provide high-quality data for model alignment.<n> evaluation results show that our approach can improve win rate of LLaMA-3.1-8B-Instruct from 19.5 to 48.3 on Arena-Hard and from 22.33 to 57.23 on AlpacaEval2.
arXiv Detail & Related papers (2025-05-05T22:40:23Z)
FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users [111.56469697145519]
We propose Few-Shot Preference Optimization, which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. We generate over 1M synthetic personalized preferences using publicly available LLMs. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study.
arXiv Detail & Related papers (2025-02-26T17:08:46Z)
Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data [51.62162460809116]
We introduce Dynamic Noise Preference Optimization (DNPO) to ensure consistent improvements across iterations. In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6%. DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations.
arXiv Detail & Related papers (2025-02-08T01:20:09Z)
Multimodal Preference Data Synthetic Alignment with Reward Model [23.978820500281213]
We propose a new framework in generating synthetic data using a reward model as a proxy of human preference for effective multimodal alignment with DPO training. Experiment results indicate that integrating selected synthetic data, such as from generative and rewards models can effectively reduce reliance on human-annotated data.
arXiv Detail & Related papers (2024-12-23T09:29:40Z)
Advancing Large Language Model Attribution through Self-Improving [32.77250400438304]
We present START, a framework for improving the attribution capability of large language models (LLMs) START iteratively utilizes fine-grained preference supervision signals constructed from its sampled responses to encourage robust, comprehensive, and attributable generation. Experiments on three open-domain question-answering datasets, covering long-form QA and multi-step reasoning, demonstrate significant performance gains of 25.13% on average.
arXiv Detail & Related papers (2024-10-17T07:55:33Z)
Boosting Reward Model with Preference-Conditional Multi-Aspect Synthetic Data Generation [62.9933120822879]
RMBoost is a novel synthetic preference data generation paradigm. It reduces labeling noise since preference pairs are constructed intentionally. It significantly boosts the performance of four distinct reward models.
arXiv Detail & Related papers (2024-07-22T19:21:55Z)
UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback [21.858896845159208]
Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset.
arXiv Detail & Related papers (2024-06-11T21:53:46Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models. The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models. Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z)
Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts [95.09994361995389]
Relative Preference Optimization (RPO) is designed to discern between more and less preferred responses derived from both identical and related prompts. RPO has demonstrated a superior ability to align large language models with user preferences and to improve their adaptability during the training process.
arXiv Detail & Related papers (2024-02-12T22:47:57Z)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN) At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z)
Aligning Large Language Models through Synthetic Feedback [43.84431341195111]
We propose a novel alignment learning framework with synthetic feedback not dependent on extensive human annotations. In human evaluation, our model is preferred to Alpaca and Dolly-v2, 55.0% and 58.5% of the time, respectively.
arXiv Detail & Related papers (2023-05-23T06:41:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.