Related papers: Training Language Models to Self-Correct via Reinforcement Learning

Training Language Models to Self-Correct via Reinforcement Learning

URL: http://arxiv.org/abs/2409.12917v2
Date: Fri, 4 Oct 2024 17:28:45 GMT
Title: Training Language Models to Self-Correct via Reinforcement Learning
Authors: Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, Aleksandra Faust,
Abstract summary: Self-correction has been found to be largely ineffective in modern large language models (LLMs) We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
Score: 98.35197671595343
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

Related papers

Boosting LLM Reasoning via Spontaneous Self-Correction [43.4980625253775]
One of the approaches for improving math reasoning is self-correction.<n>Existing self-correction approaches treat corrections as standalone post-generation refinements.<n>We propose SPOC, a spontaneous self-correction approach that enables LLMs to generate interleaved solutions and verifications in a single inference pass.
arXiv Detail & Related papers (2025-06-07T21:23:00Z)
Self-Correcting Code Generation Using Small Language Models [11.4397549365277]
Self-correction has demonstrated potential in code generation by allowing language models to revise and improve their outputs through successive refinement.<n>We introduce CoCoS, an approach designed to enhance the ability of small language models for multi-turn code correction.<n>With 1B-scale models, CoCoS achieves improvements of 35.8% on the MBPP and 27.7% on HumanEval compared to the baselines.
arXiv Detail & Related papers (2025-05-29T04:04:44Z)
Can Large Reasoning Models Self-Train? [58.953117118687096]
Scaling the performance of large language models increasingly depends on methods that reduce reliance on human supervision.<n>We propose an online self-training reinforcement learning algorithm that leverages the model's self-consistency to infer correctness signals and train without any ground-truth supervision.
arXiv Detail & Related papers (2025-05-27T17:16:00Z)
Entropy-Based Adaptive Weighting for Self-Training [15.089334734753677]
We propose Entropy-Based Adaptive Weighting for Self-Training (EAST) EAST is an adaptive weighting strategy designed to prioritize uncertain data during self-training. We evaluate our approach on GSM8K and MATH benchmarks.
arXiv Detail & Related papers (2025-03-31T10:04:35Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training [18.896813839389893]
We propose an iterative self-training framework, Agent-R, that enables language Agent to Reflect on the fly. Unlike traditional methods that reward or penalize actions based on correctness, Agent-R leverages MCTS to construct training data that recover correct trajectories from erroneous ones. Our findings demonstrate that Agent-R continuously improves the model's ability to recover from errors and enables timely error correction.
arXiv Detail & Related papers (2025-01-20T11:46:04Z)
Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening. Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training. We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z)
Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks [43.96835245022083]
Self-correction that instructs models to refine their outputs presents a promising solution to this issue. This study investigates the self-correction capabilities of Vision-Language Models during both inference and fine-tuning stages.
arXiv Detail & Related papers (2024-10-05T06:28:54Z)
Small Language Models Need Strong Verifiers to Self-Correct Reasoning [69.94251699982388]
Self-correction has emerged as a promising solution to boost the reasoning performance of large language models (LLMs) This work explores whether small (= 13B) language models (LMs) have the ability of self-correction on reasoning tasks with minimal inputs from stronger LMs.
arXiv Detail & Related papers (2024-04-26T03:41:28Z)
Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance. We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model [119.65409513119963]
We introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form. The resulting algorithm, which we call Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods.
arXiv Detail & Related papers (2023-05-29T17:57:46Z)
Learning Residual Model of Model Predictive Control via Random Forests for Autonomous Driving [13.865293598486492]
One major issue in predictive control (MPC) for autonomous driving is the contradiction between the system model's prediction and computation. This paper reformulates the MPC tracking accuracy as a program (QP) problem optimization as a program (QP) can effectively solve it.
arXiv Detail & Related papers (2023-04-10T03:32:09Z)
Type-Driven Multi-Turn Corrections for Grammatical Error Correction [46.34114495164071]
Grammatical Error Correction (GEC) aims to automatically detect and correct grammatical errors. Previous studies mainly focus on the data augmentation approach to combat the exposure bias. We propose a Type-Driven Multi-Turn Corrections approach for GEC.
arXiv Detail & Related papers (2022-03-17T07:30:05Z)
Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition [55.362258027878966]
We present momentum pseudo-labeling (MPL) as a simple yet effective strategy for semi-supervised speech recognition. MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method. The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios.
arXiv Detail & Related papers (2021-06-16T16:24:55Z)
PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models [9.496524884855559]
This paper presents an efficient provenance-based approach, PrIU, for incrementally updating model parameters without sacrificing prediction accuracy. We prove the correctness and convergence of the incrementally updated model parameters, and validate it experimentally. Experimental results show that up to two orders of magnitude speed-ups can be achieved by PrIU-opt compared to simply retraining the model from scratch, yet obtaining highly similar models.
arXiv Detail & Related papers (2020-02-26T21:04:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.