Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
- URL: http://arxiv.org/abs/2506.06395v3
- Date: Wed, 11 Jun 2025 06:21:59 GMT
- Title: Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
- Authors: Pengyi Li, Matvey Skripkin, Alexander Zubrey, Andrey Kuznetsov, Ivan Oseledets,
- Abstract summary: We propose Reinforcement Learning via Self-Confidence (RLSC) for large language models (LLMs)<n>RLSC uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering.
- Score: 43.98994504606355
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.
Related papers
- RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning [18.62575670251997]
Reinforcement learning (RL) for large language models is an energy-intensive endeavor.<n>We present emphRLEP, a framework that first collects verified trajectories and then replays them during subsequent training.<n>At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with replayed successes.
arXiv Detail & Related papers (2025-07-10T05:58:55Z) - Resa: Transparent Reasoning Models via SAEs [14.617192915344349]
SAE-Tuning is a family of 1.5B reasoning models trained via a novel and efficient sparse autoencoder tuning procedure.<n>When applied to certain base models before further RL post-training, SAE-Tuning retains >97% of its RL-trained counterpart's reasoning performance.<n>It enables reasoning performance such as 43.33% Pass@1 on AIME24 and 90% Pass@1 on AMC23 for only around $1 additional cost.
arXiv Detail & Related papers (2025-06-11T17:44:01Z) - AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning [50.02117478165099]
We show that large-scale reinforcement learning can significantly enhance the reasoning capabilities of strong, small- and mid-sized models.<n>We propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts.
arXiv Detail & Related papers (2025-05-22T08:50:47Z) - RM-R1: Reward Modeling as Reasoning [81.50471199906738]
Reasoning Reward Models (ReasRMs) formulate reward modeling as a reasoning task.<n>We propose a reasoning-oriented training pipeline and train a family of ReasRMs, RM-R1.<n>Our models achieve state-of-the-art performance across three reward model benchmarks on average.
arXiv Detail & Related papers (2025-05-05T06:11:12Z) - Think, Prune, Train, Improve: Scaling Reasoning without Scaling Models [1.96238419451815]
Large language models (LLMs) have demonstrated strong capabilities in programming and mathematical reasoning tasks, but are constrained by limited high-quality training data.<n>We introduce a scalable framework that iteratively fine-tunes models on their own reasoning traces, using ground-truth pruning to ensure high-quality training data.<n>This approach yields improved performance: on GSM8K, Gemma2-2B achieves a Pass@1 of 57.6% (from 41.9%), Gemma2-9B reaches 82%, matching LLaMA-3.1-70B, and LLaMA-3.1-70B attains 91%, even surpassing GPT-4o
arXiv Detail & Related papers (2025-04-25T06:48:55Z) - SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild [46.25416990387885]
Long chain-of-thought (CoT) reasoning can naturally emerge through a simple reinforcement learning framework with rule-based rewards.<n>We investigate zero RL training across 10 diverse base models including LLama3-8B, Mistral-7B/24B, DeepSeek-Math-7B, Qwen2.5-math-7B, and all Qwen2.5 models from 0.5B to 32B.
arXiv Detail & Related papers (2025-03-24T17:06:10Z) - LIMR: Less is More for RL Scaling [25.477841726836836]
We introduce Learning Impact Measurement (LIM), an automated method to evaluate and prioritize training samples.<n>Our method achieves comparable or even superior performance using only 1,389 samples versus the full 8,523 samples dataset.<n>For reproducible research and future innovation, we are open-sourcing LIMR, including implementation of LIM, training and evaluation code, curated datasets, and trained models.
arXiv Detail & Related papers (2025-02-17T15:13:29Z) - Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.<n>Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.<n>This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z) - Common 7B Language Models Already Possess Strong Math Capabilities [61.61442513067561]
This paper shows that the LLaMA-2 7B model with common pre-training already exhibits strong mathematical abilities.
The potential for extensive scaling is constrained by the scarcity of publicly available math questions.
arXiv Detail & Related papers (2024-03-07T18:00:40Z) - Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs)
We propose a new RL method named RLMEC that incorporates a generative model as the reward model.
Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z) - LIMA: Less Is More for Alignment [112.93890201395477]
We train LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses.
LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples.
In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases.
arXiv Detail & Related papers (2023-05-18T17:45:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.