Related papers: SPACE: Noise Contrastive Estimation Stabilizes Self-Play Fine-Tuning for Large Language Models

SPACE: Noise Contrastive Estimation Stabilizes Self-Play Fine-Tuning for Large Language Models

URL: http://arxiv.org/abs/2512.07175v1
Date: Mon, 08 Dec 2025 05:16:18 GMT
Title: SPACE: Noise Contrastive Estimation Stabilizes Self-Play Fine-Tuning for Large Language Models
Authors: Yibo Wang, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Lijun Zhang,
Abstract summary: We introduce a novel self-play fine-tuning method, namely Self-PlAy via Noise Contrastive Estimation (SPACE)<n>SPACE treats synthetic samples as auxiliary components, and discriminates them from the real ones in a binary classification manner.<n>We show that SPACE significantly improves the performance of LLMs over various tasks, and outperforms supervised fine-tuning that employs much more real-world samples.
Score: 35.53535406831892
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-play fine-tuning has demonstrated promising abilities in adapting large language models (LLMs) to downstream tasks with limited real-world data. The basic principle is to iteratively refine the model with real samples and synthetic ones generated from itself. However, the existing methods primarily focus on the relative gaps between the rewards for two types of data, neglecting their absolute values. Through theoretical analysis, we identify that the gap-based methods suffer from unstable evolution, due to the potentially degenerated objectives. To address this limitation, we introduce a novel self-play fine-tuning method, namely Self-PlAy via Noise Contrastive Estimation (SPACE), which leverages noise contrastive estimation to capture the real-world data distribution. Specifically, SPACE treats synthetic samples as auxiliary components, and discriminates them from the real ones in a binary classification manner. As a result, SPACE independently optimizes the absolute reward values for each type of data, ensuring a consistently meaningful objective and thereby avoiding the instability issue. Theoretically, we show that the optimal solution of the objective in SPACE aligns with the underlying distribution of real-world data, and SPACE guarantees a provably stable convergence to the optimal distribution. Empirically, we show that SPACE significantly improves the performance of LLMs over various tasks, and outperforms supervised fine-tuning that employs much more real-world samples. Compared to gap-based self-play fine-tuning methods, SPACE exhibits remarkable superiority and stable evolution.

Related papers

Online Bayesian Experimental Design for Partially Observed Dynamical Systems [10.774974720491565]
We develop a principled framework for optimizing data collection in dynamical systems with partial observability.<n>Our framework successfully handles both partial observability and online inference.
arXiv Detail & Related papers (2025-11-06T14:29:05Z)
Modest-Align: Data-Efficient Alignment for Vision-Language Models [67.48633659305592]
Cross-modal alignment models often suffer from overconfidence and degraded performance when operating in resource-constrained settings.<n>We propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency.<n>Our method offers a practical and scalable solution for cross-modal alignment in real-world, low-resource scenarios.
arXiv Detail & Related papers (2025-10-24T16:11:10Z)
Scalable Asynchronous Federated Modeling for Spatial Data [5.741482548419374]
This work proposes an asynchronous federated modeling framework for spatial data based on low-rank Gaussian process approximations.<n>We establish linear convergence with explicit dependence on staleness, a result of standalone theoretical significance.<n> numerical experiments demonstrate that the asynchronous algorithm achieves synchronous performance under balanced resource allocation.
arXiv Detail & Related papers (2025-10-02T08:04:46Z)
Bridging the Synthetic-Real Gap: Supervised Domain Adaptation for Robust Spacecraft 6-DoF Pose Estimation [13.83897333268682]
Spacecraft Pose Estimation is a fundamental capability for autonomous space operations such as rendezvous, docking, and in-orbit docking.<n>Existing domain adaptation approaches aim to mitigate this issue but often underperform when a modest number of labeled target samples are available.<n>We propose the first Supervised Domain Adaptation (SDA) framework tailored for SPE keypoint regression.
arXiv Detail & Related papers (2025-09-17T08:03:05Z)
RoHOI: Robustness Benchmark for Human-Object Interaction Detection [84.78366452133514]
Human-Object Interaction (HOI) detection is crucial for robot-human assistance, enabling context-aware support.<n>We introduce the first benchmark for HOI detection, evaluating model resilience under diverse challenges.<n>Our benchmark, RoHOI, includes 20 corruption types based on the HICO-DET and V-COCO datasets and a new robustness-focused metric.
arXiv Detail & Related papers (2025-07-12T01:58:04Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Deriving Causal Order from Single-Variable Interventions: Guarantees & Algorithm [14.980926991441345]
We show that datasets containing interventional data can be effectively extracted under realistic assumptions about the data distribution.<n>We introduce a novel variant of interventional faithfulness, which relies on comparisons between the marginal distributions of each variable across observational and interventional settings.<n>We also introduce Intersort, an algorithm designed to infer the causal order from datasets containing large numbers of single-variable interventions.
arXiv Detail & Related papers (2024-05-28T16:07:17Z)
Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC) LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses. LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z)
Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL) We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking. We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z)
Decomposed Adversarial Learned Inference [118.27187231452852]
We propose a novel approach, Decomposed Adversarial Learned Inference (DALI) DALI explicitly matches prior and conditional distributions in both data and code spaces. We validate the effectiveness of DALI on the MNIST, CIFAR-10, and CelebA datasets.
arXiv Detail & Related papers (2020-04-21T20:00:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.