Aligner: Efficient Alignment by Learning to Correct
- URL: http://arxiv.org/abs/2402.02416v5
- Date: Sat, 02 Nov 2024 10:01:38 GMT
- Title: Aligner: Efficient Alignment by Learning to Correct
- Authors: Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, Tianyi Qiu, Yaodong Yang,
- Abstract summary: We introduce Aligner, a model-agnostic, plug-and-play module that learns the correctional residuals between preferred and dispreferred answers.
It can be applied to various open-source and API-based models with only one-off training, making it suitable for rapid iteration.
Our experiments demonstrate performance improvements by deploying the same Aligner model across 11 different language models.
- Score: 10.056049435141645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the rapid development of large language models (LLMs) and ever-evolving practical requirements, finding an efficient and effective alignment method has never been more critical. However, the tension between the complexity of current alignment methods and the need for rapid iteration in deployment scenarios necessitates the development of a model-agnostic alignment approach that can operate under these constraints. In this paper, we introduce Aligner, a novel and simple alignment paradigm that learns the correctional residuals between preferred and dispreferred answers using a small model. Designed as a model-agnostic, plug-and-play module, Aligner can be directly applied to various open-source and API-based models with only one-off training, making it suitable for rapid iteration. Notably, Aligner can be applied to any powerful, large-scale upstream models. Moreover, it can even iteratively bootstrap the upstream models using corrected responses as synthetic human preference data, breaking through the model's performance ceiling. Our experiments demonstrate performance improvements by deploying the same Aligner model across 11 different LLMs, evaluated on the 3H dimensions (helpfulness, harmlessness, and honesty). Specifically, Aligner-7B has achieved an average improvement of 68.9% in helpfulness and 23.8% in harmlessness across the tested LLMs while also effectively reducing hallucination. In the Alpaca-Eval leaderboard, stacking Aligner-2B on GPT-4 Turbo improved its LC Win Rate from 55.0% to 58.3%, surpassing GPT-4 Omni's 57.5% Win Rate (community report).
Related papers
- Think, Prune, Train, Improve: Scaling Reasoning without Scaling Models [1.96238419451815]
Large language models (LLMs) have demonstrated strong capabilities in programming and mathematical reasoning tasks, but are constrained by limited high-quality training data.
We introduce a scalable framework that iteratively fine-tunes models on their own reasoning traces, using ground-truth pruning to ensure high-quality training data.
This approach yields improved performance: on GSM8K, Gemma2-2B achieves a Pass@1 of 57.6% (from 41.9%), Gemma2-9B reaches 82%, matching LLaMA-3.1-70B, and LLaMA-3.1-70B attains 91%, even surpassing GPT-4o
arXiv Detail & Related papers (2025-04-25T06:48:55Z) - DiffPO: Diffusion-styled Preference Optimization for Efficient Inference-Time Alignment of Large Language Models [50.32663816994459]
Diffusion-styled Preference Optimization (model) provides an efficient and policy-agnostic solution for aligning LLMs with humans.
modelavoids the time latency associated with token-level generation.
Experiments on AlpacaEval 2, MT-bench, and HH-RLHF demonstrate that modelachieves superior alignment performance across various settings.
arXiv Detail & Related papers (2025-03-06T09:21:54Z) - IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models [11.075423190298686]
Large language models (LLMs) are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity.
In this paper, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair.
We show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance.
arXiv Detail & Related papers (2025-02-10T22:07:02Z) - ARIES: Stimulating Self-Refinement of Large Language Models by Iterative Preference Optimization [34.77238246296517]
A truly intelligent Large Language Model (LLM) should be capable of correcting errors in its responses through external interactions.
We introduce a novel post-training and inference framework, called ARIES: Adaptive Refinement and Iterative Enhancement Structure.
ARIES iteratively performs preference training and self-refinement-based data collection.
arXiv Detail & Related papers (2025-02-08T15:21:55Z) - Stream Aligner: Efficient Sentence-Level Alignment via Distribution Induction [6.624814871290537]
Stream Aligner combines efficiency with enhanced performance in various tasks throughout the generation process.
Compared to Aligner, our experiments demonstrate that Stream Aligner reduces reliance on the capabilities of additional models, enhances the reasoning abilities of LLMs, and decreases latency during user interaction.
arXiv Detail & Related papers (2025-01-09T16:02:51Z) - Evolving Alignment via Asymmetric Self-Play [52.3079697845254]
We introduce a general open-ended RLHF framework that casts alignment as an asymmetric game between two players.
This framework of Evolving Alignment via Asymmetric Self-Play (eva) results in a simple and efficient approach that can utilize any existing RLHF algorithm for scalable alignment.
arXiv Detail & Related papers (2024-10-31T08:15:32Z) - Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model [28.569089876442682]
This work is inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor.
We propose Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model.
arXiv Detail & Related papers (2024-10-24T11:06:29Z) - Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization [64.34767799614328]
Current self-rewarding approaches rely heavily on the discriminator's judgment capabilities.
We propose a novel, only-prompting self-rewarding online algorithm that generates preference datasets without relying on judgment capabilities.
arXiv Detail & Related papers (2024-09-26T04:41:08Z) - Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [54.05511925104712]
We propose a simple, effective, and data-efficient method called Step-DPO.
Step-DPO treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically.
Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters.
arXiv Detail & Related papers (2024-06-26T17:43:06Z) - Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios.
We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples.
Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z) - RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [94.03511733306296]
We introduce RLAIF-V, a framework that aligns MLLMs in a fully open-source paradigm for super GPT-4V trustworthiness.
RLAIF-V maximally exploits the open-source feedback from two perspectives, including high-quality feedback data and online feedback learning algorithm.
Experiments show that RLAIF-V substantially enhances the trustworthiness of models without sacrificing performance on other tasks.
arXiv Detail & Related papers (2024-05-27T14:37:01Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences [21.5605000515622]
This paper studies post-training large language models (LLMs) using preference feedback from an oracle to help a model iteratively improve over itself.
We introduce Direct Nash Optimization (DNO), a provable and efficient algorithm that marries the simplicity and stability of contrastive learning with theoretical generality from optimizing general preferences.
In our experiments, a resulting 7B parameter Orca-2.5 model achieves the state-of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaE 2.0 (even after controlling for response length), an absolute gain of 26% (7% to 33%) over the initializing model
arXiv Detail & Related papers (2024-04-04T17:56:41Z) - AlpaGasus: Training A Better Alpaca with Fewer Data [93.6949102689243]
We propose a simple and effective data selection strategy that automatically identifies and filters out low-quality data.
We introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data.
AlpaGasus significantly outperforms the original Alpaca on multiple test sets and the controlled human evaluation.
arXiv Detail & Related papers (2023-07-17T17:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.