Related papers: TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning

TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning

URL: http://arxiv.org/abs/2512.13106v1
Date: Mon, 15 Dec 2025 09:03:45 GMT
Title: TraPO: A Semi-Supervised Reinforcement Learning Framework for Boosting LLM Reasoning
Authors: Shenzhi Yang, Guangcheng Zhu, Xing Zheng, Yingfan MA, Zhongqi Chen, Bowen Song, Weiqiang Wang, Junbo Zhao, Gang Chen, Haobo Wang,
Abstract summary: Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs)<n>We propose an effective policy optimization algorithm, TraPO, that identifies reliable unlabeled samples by matching their learning trajectory similarity to labeled ones.<n>With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%)
Score: 33.47825979936341
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs) by leveraging answer-verifiable signals to guide policy optimization, which, however, suffers from high annotation costs. To alleviate this problem, recent work has explored unsupervised RLVR methods that derive rewards solely from the model's internal consistency, such as through entropy and majority voting. While seemingly promising, these methods often suffer from model collapse in the later stages of training, which may arise from the reinforcement of incorrect reasoning patterns in the absence of external supervision. In this work, we investigate a novel semi-supervised RLVR paradigm that utilizes a small labeled set to guide RLVR training on unlabeled samples. Our key insight is that supervised rewards are essential for stabilizing consistency-based training on unlabeled samples, ensuring that only reasoning patterns verified on labeled instances are incorporated into RL training. Technically, we propose an effective policy optimization algorithm, TraPO, that identifies reliable unlabeled samples by matching their learning trajectory similarity to labeled ones. Building on this, TraPO achieves remarkable data efficiency and strong generalization on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%). Notably, when using 4K labeled and 12K unlabeled samples, TraPO even outperforms the fully supervised model trained on the full 45K labeled samples on all benchmarks, while using only 10% of the labeled data. The code is available via https://github.com/ShenzhiYang2000/TRAPO.

Related papers

ANML: Attribution-Native Machine Learning with Guaranteed Robustness [0.0]
We introduce ANML, a framework that weights training samples by four quality factors.<n>ANML achieves 33-72% error reduction over gradient-only baselines.<n> contributor-level attribution provides 1.3-5.3x greater improvement than sample-level methods.
arXiv Detail & Related papers (2026-02-12T08:12:30Z)
CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z)
SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning [39.1720897614261]
Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning.<n>We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them.<n>We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision.
arXiv Detail & Related papers (2025-12-02T21:30:47Z)
Reasoning with Sampling: Your Base Model is Smarter Than You Think [52.639108524651846]
We propose a simple iterative sampling algorithm leveraging the base models' own likelihoods.<n>We show that our algorithm offers substantial boosts in reasoning that nearly match and even outperform those from RL.<n>Our method does not require training, curated datasets, or a verifier.
arXiv Detail & Related papers (2025-10-16T17:18:11Z)
Self-Improving LLM Agents at Test-Time [49.9396634315896]
One paradigm of language model (LM) fine-tuning relies on creating large training datasets.<n>In practice, gathering large sets of data is inefficient, and training on them is prohibitively expensive.<n>We study two variants of this approach: Test-Time Self-Improvement (TT-SI) and Test-Time Distillation (TT-D)
arXiv Detail & Related papers (2025-10-09T06:37:35Z)
Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.<n>Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.<n>This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z)
How Low Can You Go? Surfacing Prototypical In-Distribution Samples for Unsupervised Anomaly Detection [48.30283806131551]
We show that UAD with extremely few training samples can already match -- and in some cases even surpass -- the performance of training with the whole training dataset. We propose an unsupervised method to reliably identify prototypical samples to further boost UAD performance.
arXiv Detail & Related papers (2023-12-06T15:30:47Z)
Active Self-Semi-Supervised Learning for Few Labeled Samples [4.713652957384158]
Training deep models with limited annotations poses a significant challenge when applied to diverse practical domains. We propose a simple yet effective framework, active self-semi-supervised learning (AS3L) AS3L bootstraps semi-supervised models with prior pseudo-labels (PPL) We develop active learning and label propagation strategies to obtain accurate PPL.
arXiv Detail & Related papers (2022-03-09T07:45:05Z)
Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck. We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network. We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.