Related papers: A Unified Understanding of Offline Data Selection and Online Self-refining Generation for Post-training LLMs

A Unified Understanding of Offline Data Selection and Online Self-refining Generation for Post-training LLMs

URL: http://arxiv.org/abs/2511.21056v1
Date: Wed, 26 Nov 2025 04:48:33 GMT
Title: A Unified Understanding of Offline Data Selection and Online Self-refining Generation for Post-training LLMs
Authors: Quan Xiao, Tianyi Chen,
Abstract summary: We tackle offline data selection and online self-refining generations through an optimization perspective.<n>For the first time, we theoretically demonstrate the effectiveness of the bilevel data selection framework.
Score: 55.931369468485464
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Offline data selection and online self-refining generation, which enhance the data quality, are crucial steps in adapting large language models (LLMs) to specific downstream tasks. We tackle offline data selection and online self-refining generations through an optimization perspective. Specifically, bilevel data selection is used for offline data selection with respect to the validation dataset, and we treat online self-refining generation as a model adaptation step of selecting the model trained on current responses that best fits the validation data. Our framework offers a unified understanding of offline data selection and self-refining generation by assigning a learned data weight to each question and response, either explicitly or implicitly. For the first time, we theoretically demonstrate the effectiveness of the bilevel data selection framework and demonstrate its performance gains over unfiltered direct mixing baselines. By combining offline data with validation-weighted online generations, our method enhances fine-tuning performance. Experiments on quality enhancement and safety-aware LLM fine-tuning validate its effectiveness.

Related papers

Towards Understanding Valuable Preference Data for Large Language Model Alignment [85.38864561060088]
Large language model (LLM) alignment is typically achieved through learning from human preference comparisons.<n>We assess data quality through individual influence on validation data using our newly proposed truncated influence function (TIF)<n>To this end, we combine them to offset their diverse error sources, resulting in a simple yet effective data selection rule.
arXiv Detail & Related papers (2025-10-15T06:57:55Z)
Offline vs. Online Learning in Model-based RL: Lessons for Data Collection Strategies [41.452036409068235]
Data collection is crucial for learning robust world models in model-based reinforcement learning.<n>Online vs. offline data on world models and thus on the resulting task performance have not been thoroughly studied in the literature.<n>We identify a key challenge behind performance degradation of offline agents: encountering Out-Of-Distribution states at test time.<n>We demonstrate that this issue can be mitigated by allowing for additional online interactions in a fixed or adaptive schedule.
arXiv Detail & Related papers (2025-09-06T14:52:33Z)
Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation [22.13678670717358]
Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions.<n>Existing work used offline datasets to generate data that conform to the online data distribution for data augmentation.<n>We propose a new data augmentation approach, Diffusion-Free Generation (CFDG)
arXiv Detail & Related papers (2025-08-09T03:32:23Z)
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment [94.36403843133616]
Using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks.<n>Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions.<n>We propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions.
arXiv Detail & Related papers (2025-05-25T17:42:52Z)
Offline Clustering of Linear Bandits: The Power of Clusters under Limited Data [60.91600085523719]
We study the offline clustering of bandits (Off-ClusBand) problem, which studies how to use the offline dataset to learn cluster properties and improve decision-making.<n>We propose two algorithms: Off-C2LUB, which we show analytically and experimentally outperforms existing methods under limited offline user data, and Off-CLUB, which may incur bias when data is sparse but performs well and nearly matches the lower bound when data is sufficient.
arXiv Detail & Related papers (2025-05-25T08:43:40Z)
Goal-Conditioned Data Augmentation for Offline Reinforcement Learning [9.181158786602085]
We introduce Goal-cOnditioned Data Augmentation (GODA), a goal-conditioned diffusion-based method for augmenting samples with higher quality.<n>GODA learns a comprehensive distribution representation of the original offline datasets while generating new data with selectively higher-return goals.<n>We conduct experiments on the D4RL benchmark and real-world challenges, specifically traffic signal control (TSC) tasks, to demonstrate GODA's effectiveness.
arXiv Detail & Related papers (2024-12-29T16:42:30Z)
Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment [104.18002641195442]
We introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm that does not require existing paired data. Building on the self-play concept, which autonomously generates negative responses, we further incorporate an off-policy learning pipeline to enhance data exploration and exploitation.
arXiv Detail & Related papers (2024-05-31T14:21:04Z)
Online Self-Preferring Language Models [34.22412851864247]
Online Self-Preferring (OSP) language models learn from self-generated response pairs and self-judged preference strengths. OSP achieves state-of-the-art alignment performance across various metrics in two widely used human preference datasets.
arXiv Detail & Related papers (2024-05-23T02:13:34Z)
Adaptive Policy Learning for Offline-to-Online Reinforcement Learning [27.80266207283246]
We consider an offline-to-online setting where the agent is first learned from the offline dataset and then trained online. We propose a framework called Adaptive Policy Learning for effectively taking advantage of offline and online data.
arXiv Detail & Related papers (2023-03-14T08:13:21Z)
AliExpress Learning-To-Rank: Maximizing Online Model Performance without Going Online [60.887637616379926]
This paper proposes an evaluator-generator framework for learning-to-rank. It consists of an evaluator that generalizes to evaluate recommendations involving the context, and a generator that maximizes the evaluator score by reinforcement learning. Our method achieves a significant improvement in terms of Conversion Rate (CR) over the industrial-level fine-tuned model in online A/B tests.
arXiv Detail & Related papers (2020-03-25T10:27:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.