Related papers: Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment

Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment

URL: http://arxiv.org/abs/2508.07750v1
Date: Mon, 11 Aug 2025 08:28:47 GMT
Title: Learning to Align, Aligning to Learn: A Unified Approach for Self-Optimized Alignment
Authors: Haowen Wang, Yun Yue, Zhiling Ye, Shuowen Zhang, Lei Fan, Jiaxin Liang, Jiadi Jiang, Cheng Wei, Jingyuan Deng, Xudong Han, Ji Li, Chunxiao Guo, Peng Wei, Jian Wang, Jinjie Gu,
Abstract summary: We propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the strengths of SFT (supervised fine-tuning) and RL (reinforcement learning)<n>Our theoretical analysis establishes GRAO's convergence guarantees and sample efficiency advantages over conventional approaches.<n>This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models.
Score: 24.296667264939515
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Alignment methodologies have emerged as a critical pathway for enhancing language model alignment capabilities. While SFT (supervised fine-tuning) accelerates convergence through direct token-level loss intervention, its efficacy is constrained by offline policy trajectory. In contrast, RL(reinforcement learning) facilitates exploratory policy optimization, but suffers from low sample efficiency and stringent dependency on high-quality base models. To address these dual challenges, we propose GRAO (Group Relative Alignment Optimization), a unified framework that synergizes the respective strengths of SFT and RL through three key innovations: 1) A multi-sample generation strategy enabling comparative quality assessment via reward feedback; 2) A novel Group Direct Alignment Loss formulation leveraging intra-group relative advantage weighting; 3) Reference-aware parameter updates guided by pairwise preference dynamics. Our theoretical analysis establishes GRAO's convergence guarantees and sample efficiency advantages over conventional approaches. Comprehensive evaluations across complex human alignment tasks demonstrate GRAO's superior performance, achieving 57.70\%,17.65\% 7.95\% and 5.18\% relative improvements over SFT, DPO, PPO and GRPO baselines respectively. This work provides both a theoretically grounded alignment framework and empirical evidence for efficient capability evolution in language models.

Related papers

MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z)
GDEPO: Group Dual-dynamic and Equal-right-advantage Policy Optimization with Enhanced Training Data Utilization for Sample-Constrained Reinforcement Learning [14.111530312590531]
Automated Theorem Proving (ATP) represents a fundamental challenge in Artificial Intelligence (AI)<n>We propose Group Dual-dynamic and Equal-right-advantage Policy Optimization (GDEPO)<n>GDEPO incorporates three core mechanisms: 1) dynamic additional sampling, which resamples invalid batches until a valid proof is discovered; 2) equal-right advantage, decoupling the sign of the advantage function from its magnitude (modulated by auxiliary rewards) to ensure stable and correct policy updates; and 3) dynamic additional iterations, applying extra gradient steps to initially failed but eventually successful samples to accelerate learning on challenging cases.
arXiv Detail & Related papers (2026-01-11T07:34:41Z)
DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization [20.66452395111739]
We propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO)<n>DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification, which utilizes fine-grained scoring to dynamically mask sample pairs with low distinctiveness; and (2) Off-policy Data Augmentation, which introduces high-quality anchors to recover training signals for challenging tasks.<n>In-depth analysis confirms that DaGRPO effectively mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.
arXiv Detail & Related papers (2025-12-06T07:51:36Z)
CoT-Saliency: Unified Chain-of-Thought Reasoning for Heterogeneous Saliency Tasks [96.64597365827046]
We present the first unified framework that jointly handles three operationally heterogeneous saliency tasks.<n>We introduce a Chain-of-Thought (CoT) reasoning process in a Vision-Language Model (VLM) to bridge task heterogeneity.<n>We show our model matches or outperforms specialized SOTA methods and strong closed-source VLMs across all tasks.
arXiv Detail & Related papers (2025-11-01T04:37:01Z)
CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z)
TCPO: Thought-Centric Preference Optimization for Effective Embodied Decision-making [75.29820290660065]
This paper proposes Thought-Centric Preference Optimization ( TCPO) for effective embodied decision-making.<n>It emphasizes the alignment of the model's intermediate reasoning process, mitigating the problem of model degradation.<n>Experiments in the ALFWorld environment demonstrate an average success rate of 26.67%, achieving a 6% improvement over RL4VLM.
arXiv Detail & Related papers (2025-09-10T11:16:21Z)
AMFT: Aligning LLM Reasoners by Meta-Learning the Optimal Imitation-Exploration Balance [5.748208737701793]
Large Language Models (LLMs) are typically fine-tuned for reasoning tasks through a two-stage pipeline Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)<n>Recent single-stage methods attempt to unify SFT and RL using principleds, but lack a mechanism for dynamically balancing the two paradigms.<n>We introduce textbf Meta Fine-Tuning (AMFT), a novel single-stage algorithm that learns the optimal balance between SFT's implicit, path-level reward and RL's explicit, outcome-based reward.
arXiv Detail & Related papers (2025-08-09T11:40:54Z)
COPO: Consistency-Aware Policy Optimization [17.328515578426227]
Reinforcement learning has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) in complex problem-solving tasks.<n>Recently, the introduction of DeepSeek R1 has inspired a surge of interest in leveraging rule-based rewards as a low-cost alternative for computing advantage functions and guiding policy optimization.<n>We propose a consistency-aware policy optimization framework that introduces a structured global reward based on outcome consistency.
arXiv Detail & Related papers (2025-08-06T07:05:18Z)
Implicit Reward as the Bridge: A Unified View of SFT and DPO Connections [65.36449542323277]
We present a unified theoretical framework bridgingSupervised Fine-Tuning (SFT) and preference learning in Large Language Model (LLM) post-training.<n>We propose a simple yet effective learning rate reduction approach that yields significant performance improvements.
arXiv Detail & Related papers (2025-06-15T05:42:29Z)
TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization [12.061547251822326]
Trajectory-based Group Relative Policy Optimization (TGRPO) is an online RL-based training framework for Visual-Language-Action (VLA) models.<n>We show that TGRPO achieves an average success rate of 80.7%, which is 4.2% higher than that of Supervised Fine-Tuning (SFT) and outperforms other representative RL-based post-training methods.
arXiv Detail & Related papers (2025-06-10T04:27:49Z)
PEO: Improving Bi-Factorial Preference Alignment with Post-Training Policy Extrapolation [5.347428263669927]
Post-Training Extrapolation Optimization (PEO) is a novel and efficient framework for bi-factorial alignment.<n>PEO generates a family of optimal policies in a single training pass by leveraging a three-phase pipeline.
arXiv Detail & Related papers (2025-03-03T06:56:39Z)
Gradient Correction in Federated Learning with Adaptive Optimization [19.93709245766609]
We propose tt FAdamGC, the first algorithm to integrate client-drift compensation into adaptive optimization.<n>We show that tt FAdamGC consistently outperforms existing methods in total communication and cost across varying levels of data.
arXiv Detail & Related papers (2025-02-04T21:21:30Z)
Bridging SFT and DPO for Diffusion Model Alignment with Self-Sampling Preference Optimization [67.8738082040299]
Self-Sampling Preference Optimization (SSPO) is a new alignment method for post-training reinforcement learning.<n>SSPO eliminates the need for paired data and reward models while retaining the training stability of SFT.<n>SSPO surpasses all previous approaches on the text-to-image benchmarks and demonstrates outstanding performance on the text-to-video benchmarks.
arXiv Detail & Related papers (2024-10-07T17:56:53Z)
Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization [75.1240295759264]
We propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC.<n>We increase the consistency and informativeness of the pairwise preference signals through targeted modifications.<n>We identify that DPO alone is insufficient to model these correlations and capture nuanced variations.
arXiv Detail & Related papers (2024-08-14T11:29:47Z)
Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint [56.74058752955209]
This paper studies the alignment process of generative models with Reinforcement Learning from Human Feedback (RLHF) We first identify the primary challenges of existing popular methods like offline PPO and offline DPO as lacking in strategical exploration of the environment. We propose efficient algorithms with finite-sample theoretical guarantees.
arXiv Detail & Related papers (2023-12-18T18:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.