Related papers: Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation

Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation

URL: http://arxiv.org/abs/2509.05605v1
Date: Sat, 06 Sep 2025 05:38:47 GMT
Title: Icon$^{2}$: Aligning Large Language Models Using Self-Synthetic Preference Data via Inherent Regulation
Authors: Qiyuan Chen, Hongsen Huang, Qian Shao, Jiahe Chen, Jintai Chen, Hongxia Xu, Renjie Hua, Ren Chuan, Jian Wu,
Abstract summary: Large Language Models (LLMs) require high quality preference datasets to align with human preferences.<n>In this work, we explore a paradigm shift by leveraging inherent regulation of LLMs' representation space for efficient and tailored preference dataset construction.
Score: 14.249938992666202
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) require high quality preference datasets to align with human preferences. However, conventional methods for constructing such datasets face significant challenges: reliance on pre-collected instructions often leads to distribution mismatches with target models, while the need for sampling multiple stochastic responses introduces substantial computational overhead. In this work, we explore a paradigm shift by leveraging inherent regulation of LLMs' representation space for efficient and tailored preference dataset construction, named Icon$^{2}$. Specifically, it first extracts layer-wise direction vectors to encode sophisticated human preferences and then uses these vectors to filter self-synthesized instructions based on their inherent consistency. During decoding, bidirectional inherent control is applied to steer token representations, enabling the precise generation of response pairs with clear alignment distinctions. Experimental results demonstrate significant improvements in both alignment and efficiency. Llama3-8B and Qwen2-7B achieve an average win rate improvement of 13.89% on AlpacaEval 2.0 and 13.45% on Arena-Hard, while reducing computational costs by up to 48.1%.

Related papers

Auto-Rubric: Learning to Extract Generalizable Criteria for Reward Modeling [37.237020102873]
Reward models are essential for aligning Large Language Models with human values, yet their development is hampered by costly preference datasets and poor interpretability.<n>We build a training-free framework that infers high-quality, query-specific rubrics using a validation-guided textbfPropose-Evaluate-Revise pipeline.<n>Using just 70 preference pairs (1.5% of the source data), our method also empowers smaller models like Qwen3-8B to outperform specialized, fully-trained counterparts.
arXiv Detail & Related papers (2025-10-20T09:01:37Z)
Direct Preference Optimization with Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences [14.686788596611246]
Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values.<n>Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences.<n>We propose a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.
arXiv Detail & Related papers (2025-10-17T15:00:40Z)
Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning [71.30276778807068]
We propose a unified framework that strategically coordinates sample pruning and token pruning.<n>Q-Tuning achieves a +38% average improvement over the full-data SFT baseline using only 12.5% of the original training data.
arXiv Detail & Related papers (2025-09-28T13:27:38Z)
Multi-Preference Lambda-weighted Listwise DPO for Small-Scale Model Alignment [5.276657230880984]
Large language models (LLMs) demonstrate strong generalization across a wide range of language tasks, but often generate outputs that misalign with human preferences.<n>Direct Optimization Preference (DPO) simplifies the process by treating alignment as a classification task over binary preference pairs.<n>We propose Multi-Preference Lambda-weighted Listwise DPO, which allows the model to learn from more detailed human feedback.<n>Our method consistently outperforms standard DPO on alignment while enabling efficient, controllable, and fine-grained adaptation suitable for real-world deployment.
arXiv Detail & Related papers (2025-06-24T16:47:17Z)
Anyprefer: An Agentic Framework for Preference Data Synthesis [62.3856754548222]
We propose Anyprefer, a framework designed to synthesize high-quality preference data for aligning the target model.<n> external tools are introduced to assist the judge model in accurately rewarding the target model's responses.<n>The synthesized data is compiled into a new preference dataset, Anyprefer-V1, consisting of 58K high-quality preference pairs.
arXiv Detail & Related papers (2025-04-27T15:21:59Z)
DONOD: Efficient and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning [22.704995231753397]
Ad-hoc instruction fine-tuning of large language models (LLMs) is widely adopted for domain-specific adaptation.<n>We propose DONOD, a lightweight model-intrinsic data pruning method.<n>By filtering out 70% of the whole dataset, we improve target-domain accuracy by 14.90% and cross-domain accuracy by 5.67%.
arXiv Detail & Related papers (2025-04-21T02:25:03Z)
Shifting Perspectives: Steering Vectors for Robust Bias Mitigation in LLMs [8.91107152198979]
We present a novel approach to bias mitigation in large language models (LLMs) by applying steering vectors to modify model activations in forward passes.<n>We compute 8 steering vectors, each corresponding to a different social bias axis, on a training subset of the BBQ dataset and compare the effectiveness of these to 3 additional bias mitigation methods across 4 datasets.<n>When optimized on the BBQ dataset, our individually tuned steering vectors achieve average improvements of 12.8% on BBQ, 8.3% on CLEAR-Bias, and 1% on StereoSet.
arXiv Detail & Related papers (2025-03-07T12:25:29Z)
Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization [14.50339880957898]
We aim to improve the preference optimization pipeline by taking a closer look at preference data generation and training regularization techniques. For preference data generation, we propose an iterative pairwise ranking mechanism that derives preference ranking of completions using pairwise comparison signals. For training regularization, we observe that preference optimization tends to achieve better convergence when the LLM predicted likelihood of preferred samples gets slightly reduced.
arXiv Detail & Related papers (2024-11-07T23:03:11Z)
Self-Steering Optimization: Autonomous Preference Optimization for Large Language Models [79.84205827056907]
We present Self-Steering Optimization ($SSO$), an algorithm that autonomously generates high-quality preference data.<n>$SSO$ employs a specialized optimization objective to build a data generator from the policy model itself, which is used to produce accurate and on-policy data.<n>Our evaluation shows that $SSO$ consistently outperforms baselines in human preference alignment and reward optimization.
arXiv Detail & Related papers (2024-10-22T16:04:03Z)
Anchored Preference Optimization and Contrastive Revisions: Addressing Underspecification in Alignment [57.03947082589616]
Large Language Models (LLMs) are often aligned using contrastive alignment objectives and preference pair datasets. We study this and find that preference data gives a better learning signal when the underlying responses are contrastive. We introduce Contrastive Learning from AI Revisions (CLAIR), a data-creation method which leads to more contrastive preference pairs. Our best model, trained on 32K CLAIR preferences with APO, improves Llama-3-8B-Instruct by 7.65%, closing the gap with GPT4-turbo by 45%.
arXiv Detail & Related papers (2024-08-12T16:24:51Z)
Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment [72.99676237703099]
We propose a new framework that boosts the alignment of large language models with human preferences.<n>Our key idea is leveraging the human prior knowledge within the small (seed) data.<n>We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z)
Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values. We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO) Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z)
Direct Preference Optimization With Unobserved Preference Heterogeneity: The Necessity of Ternary Preferences [14.686788596611246]
Reinforcement Learning from Human Feedback (RLHF) has become central to aligning large language models with human values.<n>Recent alternatives such as Direct Preference Optimization (DPO) simplify this pipeline by directly optimizing on preferences.<n>We propose a theoretical and algorithmic framework for fairness and personalization for diverse users in generative model alignment.
arXiv Detail & Related papers (2024-05-23T21:25:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.