Related papers: ScRPO: From Errors to Insights

ScRPO: From Errors to Insights

URL: http://arxiv.org/abs/2511.06065v2
Date: Wed, 12 Nov 2025 01:27:03 GMT
Title: ScRPO: From Errors to Insights
Authors: Lianrui Li, Dakuan Lu, Jiawei Shao, Chi Zhang, Xuelong Li,
Abstract summary: We propose Self-correction Relative Policy Optimization (ScRPO) to enhance large language models on challenging mathematical problems.<n>Our approach consists of two stages: trial-and-error learning stage and self-correction learning stage.<n>Extensive experiments across multiple math reasoning benchmarks, including AIME, AMC, Olympiad, MATH-500, GSM8k, using Deepseek-Distill-Qwen-1.5B and Deepseek-Distill-Qwen-7B.
Score: 47.828888776503675
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose Self-correction Relative Policy Optimization (ScRPO), a novel reinforcement learning framework designed to enhance large language models on challenging mathematical problems by leveraging self-reflection and error correction. Our approach consists of two stages: (1) Trial-and-error learning stage: training the model with GRPO and collecting incorrect answers along with their corresponding questions in an error pool; (2) Self-correction learning stage: guiding the model to reflect on why its previous answers were wrong. Extensive experiments across multiple math reasoning benchmarks, including AIME, AMC, Olympiad, MATH-500, GSM8k, using Deepseek-Distill-Qwen-1.5B and Deepseek-Distill-Qwen-7B. The experimental results demonstrate that ScRPO consistently outperforms several post-training methods. These findings highlight ScRPO as a promising paradigm for enabling language models to self-improve on difficult tasks with limited external feedback, paving the way toward more reliable and capable AI systems.

Related papers

Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z)
Your Pretrained Model Tells the Difficulty Itself: A Self-Adaptive Curriculum Learning Paradigm for Natural Language Understanding [53.63482987410292]
We present a self-adaptive curriculum learning paradigm that prioritizes fine-tuning examples based on difficulty scores predicted by pre-trained language models.<n>We evaluate our method on four natural language understanding (NLU) datasets covering both binary and multi-class classification tasks.
arXiv Detail & Related papers (2025-07-13T19:36:17Z)
GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models [0.3831554157916835]
Group Relative Policy Optimization ( GRPO) is widely adopted by R1-like reasoning models.<n>We propose GRPO-LEAD, enhancing GRPO with: (1) length-regularized rewards to encourage conciseness while maintaining accuracy; (2) explicit penalties for incorrect solutions to improve model precision; and (3) difficulty-aware advantage reweighting for robust generalization on challenging problems.<n>Our approach achieves state-of-the-art performance for 14B-scale models, underscoring the synergy of our methods with appropriate model scale and high-quality data.
arXiv Detail & Related papers (2025-04-13T19:07:45Z)
AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO [0.0]
Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring visual spatial reasoning.<n>We introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation.
arXiv Detail & Related papers (2025-02-20T16:05:18Z)
Iterative Deepening Sampling as Efficient Test-Time Scaling [27.807695570974644]
Recent reasoning models, such as OpenAI's O1 series, have demonstrated exceptional performance on complex reasoning tasks.<n>We propose a novel iterative deepening sampling algorithm framework designed to enhance self-correction and generate higher-quality samples.
arXiv Detail & Related papers (2025-02-08T04:39:51Z)
SuperCorrect: Advancing Small LLM Reasoning with Thought Template Distillation and Self-Correction [89.56181323849512]
SuperCorrect is a novel two-stage framework that uses a large teacher model to supervise and correct both the reasoning and reflection processes of a smaller student model.<n>In the first stage, we extract hierarchical high-level and detailed thought templates from the teacher model to guide the student model in eliciting more fine-grained reasoning thoughts.<n>In the second stage, we introduce cross-model collaborative direct preference optimization (DPO) to enhance the self-correction abilities of the student model.
arXiv Detail & Related papers (2024-10-11T17:25:52Z)
Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)<n>RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation.<n>Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
Training Language Models to Self-Correct via Reinforcement Learning [98.35197671595343]
Self-correction has been found to be largely ineffective in modern large language models (LLMs) We develop a multi-turn online reinforcement learning approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. We find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
arXiv Detail & Related papers (2024-09-19T17:16:21Z)
Recursive Introspection: Teaching Language Model Agents How to Self-Improve [30.086494067593268]
We develop RISE: Recursive IntroSpEction, an approach for fine-tuning large language models. Our experiments show that RISE enables Llama2, Llama3, and Mistral models to improve themselves with more turns on math reasoning tasks.
arXiv Detail & Related papers (2024-07-25T17:35:59Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.