AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence
- URL: http://arxiv.org/abs/2502.13943v1
- Date: Wed, 19 Feb 2025 18:35:55 GMT
- Title: AdaptiveStep: Automatically Dividing Reasoning Step through Model Confidence
- Authors: Yuliang Liu, Junjie Lu, Zhaoling Chen, Chaofeng Qu, Jason Klein Liu, Chonghan Liu, Zefan Cai, Yunhui Xia, Li Zhao, Jiang Bian, Chuheng Zhang, Wei Shen, Zhouhan Lin,
- Abstract summary: We propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word.
We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks.
- Score: 29.551802573731305
- License:
- Abstract: Current approaches for training Process Reward Models (PRMs) often involve breaking down responses into multiple reasoning steps using rule-based techniques, such as using predefined placeholder tokens or setting the reasoning step's length into a fixed size. These approaches overlook the fact that specific words do not typically mark true decision points in a text. To address this, we propose AdaptiveStep, a method that divides reasoning steps based on the model's confidence in predicting the next word. This division method provides more decision-making information at each step, enhancing downstream tasks, such as reward model learning. Moreover, our method does not require manual annotation. We demonstrate its effectiveness through experiments with AdaptiveStep-trained PRMs in mathematical reasoning and code generation tasks. Experimental results indicate that the outcome PRM achieves state-of-the-art Best-of-N performance, surpassing greedy search strategy with token-level value-guided decoding, while also reducing construction costs by over 30% compared to existing open-source PRMs. In addition, we provide a thorough analysis and case study on the PRM's performance, transferability, and generalization capabilities.
Related papers
- BRiTE: Bootstrapping Reinforced Thinking Process to Enhance Language Model Reasoning [78.63421517563056]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks.
We present a unified probabilistic framework that formalizes LLM reasoning through a novel graphical model.
We introduce the Bootstrapping Reinforced Thinking Process (BRiTE) algorithm, which works in two steps.
arXiv Detail & Related papers (2025-01-31T02:39:07Z) - Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary Feedback [94.25162866972077]
Step-KTO is a training framework that combines process-level and outcome-level binary feedback.
Our experiments show that Step-KTO significantly improves both final answer accuracy and the quality of intermediate reasoning steps.
arXiv Detail & Related papers (2025-01-18T15:38:03Z) - ReARTeR: Retrieval-Augmented Reasoning with Trustworthy Process Rewarding [25.329712997545794]
We propose Retrieval-Augmented Reasoning through Trustworthy Process Rewarding (ReARTeR)
ReARTeR enhances RAG systems' reasoning capabilities through post-training and test-time scaling.
Experimental results on multi-step reasoning benchmarks demonstrate significant improvements.
arXiv Detail & Related papers (2025-01-14T05:56:26Z) - The Lessons of Developing Process Reward Models in Mathematical Reasoning [62.165534879284735]
Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes.
We develop a consensus filtering mechanism that effectively integrates Monte Carlo (MC) estimation with Large Language Models (LLMs)
We release a new state-of-the-art PRM that outperforms existing open-source alternatives.
arXiv Detail & Related papers (2025-01-13T13:10:16Z) - PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models [28.74956741932006]
We introduce PRMBench, a process-level benchmark to assess the fine-grained error detection capabilities of PRMs.
PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions.
arXiv Detail & Related papers (2025-01-06T16:31:45Z) - Process Reward Model with Q-Value Rankings [18.907163177605607]
Process Reward Modeling (PRM) is critical for complex reasoning and decision-making tasks.
We introduce the Process Q-value Model (PQM), a novel framework that redefines PRM in the context of a Markov Decision Process.
PQM optimize Q-value rankings based on a novel comparative loss function, enhancing the model's ability to capture the intricate dynamics among sequential decisions.
arXiv Detail & Related papers (2024-10-15T05:10:34Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z) - Reinforcement Learning with a Terminator [80.34572413850186]
We learn the parameters of the TerMDP and leverage the structure of the estimation problem to provide state-wise confidence bounds.
We use these to construct a provably-efficient algorithm, which accounts for termination, and bound its regret.
arXiv Detail & Related papers (2022-05-30T18:40:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.