Related papers: Towards a Unified View of Large Language Model Post-Training

Towards a Unified View of Large Language Model Post-Training

URL: http://arxiv.org/abs/2509.04419v1
Date: Thu, 04 Sep 2025 17:40:33 GMT
Title: Towards a Unified View of Large Language Model Post-Training
Authors: Xingtai Lv, Yuxin Zuo, Youbang Sun, Hongyi Liu, Yuntian Wei, Zhekai Chen, Lixuan He, Xuekai Zhu, Kaiyan Zhang, Bingning Wang, Ning Ding, Bowen Zhou,
Abstract summary: Two major sources of training data exist for post-training modern language models.<n>We show that approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT) are not in contradiction, but are instances of a single optimization process.<n>We propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals.
Score: 27.906878681963263
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Two major sources of training data exist for post-training modern language models: online (model-generated rollouts) data, and offline (human or other-model demonstrations) data. These two types of data are typically used by approaches like Reinforcement Learning (RL) and Supervised Fine-Tuning (SFT), respectively. In this paper, we show that these approaches are not in contradiction, but are instances of a single optimization process. We derive a Unified Policy Gradient Estimator, and present the calculations of a wide spectrum of post-training approaches as the gradient of a common objective under different data distribution assumptions and various bias-variance tradeoffs. The gradient estimator is constructed with four interchangeable parts: stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient. Motivated by our theoretical findings, we propose Hybrid Post-Training (HPT), an algorithm that dynamically selects different training signals. HPT is designed to yield both effective exploitation of demonstration and stable exploration without sacrificing learned reasoning patterns. We provide extensive experiments and ablation studies to verify the effectiveness of our unified theoretical framework and HPT. Across six mathematical reasoning benchmarks and two out-of-distribution suites, HPT consistently surpasses strong baselines across models of varying scales and families.

Related papers

Nonparametric Data Attribution for Diffusion Models [57.820618036556084]
Data attribution for generative models seeks to quantify the influence of individual training examples on model outputs.<n>We propose a nonparametric attribution method that operates entirely on data, measuring influence via patch-level similarity between generated and training images.
arXiv Detail & Related papers (2025-10-16T03:37:16Z)
Federated Online Learning for Heterogeneous Multisource Streaming Data [0.0]
Federated learning has emerged as an essential paradigm for distributed multi-source data analysis under privacy concerns.<n>In this paper, we propose a federated online learning (FOL) method for distributed multi-source streaming data analysis.
arXiv Detail & Related papers (2025-08-08T19:08:53Z)
MITA: Bridging the Gap between Model and Data for Test-time Adaptation [68.62509948690698]
Test-Time Adaptation (TTA) has emerged as a promising paradigm for enhancing the generalizability of models. We propose Meet-In-The-Middle based MITA, which introduces energy-based optimization to encourage mutual adaptation of the model and data from opposing directions.
arXiv Detail & Related papers (2024-10-12T07:02:33Z)
Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Preference-Based Multi-Agent Reinforcement Learning (PbMARL)<n>We identify the Nash equilibrium from a preference-only offline dataset in general-sum games.<n>Our findings underscore the multifaceted approach required for PbMARL.
arXiv Detail & Related papers (2024-09-01T13:14:41Z)
Adversarial Augmentation Training Makes Action Recognition Models More Robust to Realistic Video Distribution Shifts [12.818400676159953]
Action recognition models often lack robustness when faced with natural distribution shifts between training and test data.<n>We propose two novel evaluation methods to assess model resilience to such distribution disparity.<n>We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models.
arXiv Detail & Related papers (2024-01-21T05:50:39Z)
When Parameter-efficient Tuning Meets General-purpose Vision-language Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique. Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z)
Stabilizing Subject Transfer in EEG Classification with Divergence Estimation [17.924276728038304]
We propose several graphical models to describe an EEG classification task. We identify statistical relationships that should hold true in an idealized training scenario. We design regularization penalties to enforce these relationships in two stages.
arXiv Detail & Related papers (2023-10-12T23:06:52Z)
Statistically Efficient Variance Reduction with Double Policy Estimation for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation. We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z)
TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization [89.54947228958494]
This paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks. We propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework. TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness.
arXiv Detail & Related papers (2023-03-20T14:12:55Z)
Learning Gaussian Graphical Models with Latent Confounders [74.72998362041088]
We compare and contrast two strategies for inference in graphical models with latent confounders. While these two approaches have similar goals, they are motivated by different assumptions about confounding. We propose a new method, which combines the strengths of these two approaches.
arXiv Detail & Related papers (2021-05-14T00:53:03Z)
rTop-k: A Statistical Estimation Approach to Distributed SGD [5.197307534263253]
We show that top-k and random-k sparsification methods consistently and significantly outperforms either method applied alone. We propose a simple statistical estimation model for gradients which captures the sparsity and statistically optimal communication scheme. We show through extensive experiments on both image and language domains with CIFAR-10, ImageNet, and Penn Treebank datasets that the skewd application of these two sparsification methods consistently and significantly outperforms either method applied alone.
arXiv Detail & Related papers (2020-05-21T16:27:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.