Related papers: GPO: Learning from Critical Steps to Improve LLM Reasoning

GPO: Learning from Critical Steps to Improve LLM Reasoning

URL: http://arxiv.org/abs/2509.16456v2
Date: Tue, 21 Oct 2025 02:20:10 GMT
Title: GPO: Learning from Critical Steps to Improve LLM Reasoning
Authors: Jiahao Yu, Zelei Cheng, Xian Wu, Xinyu Xing,
Abstract summary: We introduce textbfGuided textbfPivotal textbfOptimization (GPO), a novel fine-tuning strategy that dives into the reasoning process to enable more effective improvements.<n>We demonstrate that GPO is a general strategy that can be integrated with various optimization methods to improve reasoning performance.
Score: 13.271737599933147
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly used in various domains, showing impressive potential on different tasks. Recently, reasoning LLMs have been proposed to improve the \textit{reasoning} or \textit{thinking} capabilities of LLMs to solve complex problems. Despite the promising results of reasoning LLMs, enhancing the multi-step reasoning capabilities of LLMs still remains a significant challenge. While existing optimization methods have advanced the LLM reasoning capabilities, they often treat reasoning trajectories as a whole, without considering the underlying critical steps within the trajectory. In this paper, we introduce \textbf{G}uided \textbf{P}ivotal \textbf{O}ptimization (GPO), a novel fine-tuning strategy that dives into the reasoning process to enable more effective improvements. GPO first identifies the `critical step' within a reasoning trajectory - a point that the model must carefully proceed to succeed at the problem. We locate the critical step by estimating the advantage function. GPO then resets the policy to the critical step, samples the new rollout and prioritizes the learning process on those rollouts. This focus allows the model to learn more effectively from pivotal moments within the reasoning process to improve the reasoning performance. We demonstrate that GPO is a general strategy that can be integrated with various optimization methods to improve reasoning performance. Besides theoretical analysis, our experiments across challenging reasoning benchmarks show that GPO can consistently and significantly enhance the performance of existing optimization methods, showcasing its effectiveness and generalizability in improving LLM reasoning by concentrating on pivotal moments within the generation process.

Related papers

Rectifying LLM Thought from Lens of Optimization [48.98086817378953]
Long chain-of-thought (CoT) prompting enables thorough exploration and deliberation.<n>Despite advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors.<n>We introduce RePro, a novel approach to refine LLM reasoning during post-training.
arXiv Detail & Related papers (2025-12-01T17:41:08Z)
Who Sees What? Structured Thought-Action Sequences for Epistemic Reasoning in LLMs [1.090218572228214]
This study investigates the potential of structured examples to improve the performance of LLM-based agents within a ReAct framework.<n>We propose a structured solution-processing pipeline that generates three categories of examples: optimal goal paths (G-type), informative node paths (E-type) and step-by-step optimal decision sequences contrasting alternative actions (L-type)<n>While L-type examples slightly reduce clarification requests and overall action steps, they do not yield consistent improvements.
arXiv Detail & Related papers (2025-08-20T09:36:53Z)
Revisiting LLM Reasoning via Information Bottleneck [57.519119962528166]
Large language models (LLMs) have recently demonstrated remarkable progress in reasoning capabilities through reinforcement learning with verifiable rewards (RLVR)<n>We present a theoretical characterization of LLM reasoning grounded in information bottleneck (IB) principle.<n>We propose IB-aware reasoning optimization (IBRO), a framework that encourages reasoning trajectories to be both informative about the final correct answer and generalizable.
arXiv Detail & Related papers (2025-07-24T13:14:25Z)
Feedback-Induced Performance Decline in LLM-Based Decision-Making [6.5990946334144756]
Large Language Models (LLMs) can extract context from natural language problem descriptions.<n>This paper studies the behaviour of these models within a Markov Decision Process (MDPs)
arXiv Detail & Related papers (2025-07-20T10:38:56Z)
A Survey of Scaling in Large Language Model Reasoning [62.92861523305361]
We provide a comprehensive examination of scaling in large Language models (LLMs) reasoning.<n>We analyze scaling in reasoning steps that improves multi-step inference and logical consistency.<n>We discuss scaling in training-enabled reasoning, focusing on optimization through iterative model improvement.
arXiv Detail & Related papers (2025-04-02T23:51:27Z)
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization [86.32257216965229]
We propose a new online reinforcement learning framework that enables MLLMs to self-improve reasoning ability via simple, effective and dense step-wise rewarding.<n>StepGRPO introduces two novel rule-based reasoning rewards: Step-wise Reasoning Accuracy Reward (StepRAR) and Step-wise Reasoning Validity Reward (StepRVR)<n>With the proposed StepGRPO, we introduce R1-VL, a series of MLLMs with outstanding capabilities in step-by-step reasoning.
arXiv Detail & Related papers (2025-03-17T08:51:44Z)
Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models [35.82665698868508]
Large Language Models (LLMs) struggle with high computational time and error propagation during inference time.<n>We propose Meta-Reasoner, a new framework to enable LLMs to optimize the inference compute by adjusting strategies on how to reason during inference time.<n>Our method improves performance by 9-12% over previous SOTA methods while reducing inference time by 28-35%.
arXiv Detail & Related papers (2025-02-27T09:40:13Z)
Are Language Models Up to Sequential Optimization Problems? From Evaluation to a Hegelian-Inspired Enhancement [0.0]
Large Language Models (LLMs) have demonstrated impressive capabilities across numerous fields.<n>This paper explores the proficiency of LLMs in handling Sequential Optimization Problems (SOPs)<n>We introduce WorldGen, a dynamic framework for generating unseen SOPs with controllable complexities.<n>Inspired by the influential framework of Hegelian Dialectics, we propose ACE, demonstrating how the performance of LLMs in SOP contexts can be significantly improved.
arXiv Detail & Related papers (2025-02-04T18:47:31Z)
Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z)
Unleashing the Potential of Large Language Models as Prompt Optimizers: Analogical Analysis with Gradient-based Model Optimizers [108.72225067368592]
We propose a novel perspective to investigate the design of large language models (LLMs)-based prompts.<n>We identify two pivotal factors in model parameter learning: update direction and update method.<n>We develop a capable Gradient-inspired Prompt-based GPO.
arXiv Detail & Related papers (2024-02-27T15:05:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.