Advancing Process Verification for Large Language Models via Tree-Based Preference Learning
- URL: http://arxiv.org/abs/2407.00390v1
- Date: Sat, 29 Jun 2024 10:09:49 GMT
- Title: Advancing Process Verification for Large Language Models via Tree-Based Preference Learning
- Authors: Mingqian He, Yongliang Shen, Wenqi Zhang, Zeqi Tan, Weiming Lu,
- Abstract summary: Tree-based Preference Learning Verifier (Tree-PLV) is a novel approach that constructs reasoning trees via a best-first search algorithm and collects step-level paired data for preference training.
We empirically evaluate Tree-PLV across a range of arithmetic and commonsense reasoning tasks, where it significantly outperforms existing benchmarks.
- Score: 23.63889344974957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable potential in handling complex reasoning tasks by generating step-by-step rationales.Some methods have proven effective in boosting accuracy by introducing extra verifiers to assess these paths. However, existing verifiers, typically trained on binary-labeled reasoning paths, fail to fully utilize the relative merits of intermediate steps, thereby limiting the effectiveness of the feedback provided. To overcome this limitation, we propose Tree-based Preference Learning Verifier (Tree-PLV), a novel approach that constructs reasoning trees via a best-first search algorithm and collects step-level paired data for preference training. Compared to traditional binary classification, step-level preferences more finely capture the nuances between reasoning steps, allowing for a more precise evaluation of the complete reasoning path. We empirically evaluate Tree-PLV across a range of arithmetic and commonsense reasoning tasks, where it significantly outperforms existing benchmarks. For instance, Tree-PLV achieved substantial performance gains over the Mistral-7B self-consistency baseline on GSM8K (67.55% to 82.79%), MATH (17.00% to 26.80%), CSQA (68.14% to 72.97%), and StrategyQA (82.86% to 83.25%).Additionally, our study explores the appropriate granularity for applying preference learning, revealing that step-level guidance provides feedback that better aligns with the evaluation of the reasoning process.
Related papers
- Language Model Preference Evaluation with Multiple Weak Evaluators [78.53743237977677]
GED (Preference Graph Ensemble and Denoise) is a novel approach that leverages multiple model-based evaluators to construct preference graphs.
We show that GED outperforms baseline methods in model ranking, response selection, and model alignment tasks.
arXiv Detail & Related papers (2024-10-14T01:57:25Z) - ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement [70.09541267910974]
Post-training Large Language Models (LLMs) with explicit reasoning trajectories can enhance their reasoning abilities.
Existing self-synthesizing methods suffer from poor generalization to out-of-domain (OOD) reasoning tasks.
We propose Reasoning Generalist via Self-Improvement (ReGenesis), a method to self-synthesize reasoning paths as post-training data.
arXiv Detail & Related papers (2024-10-03T00:09:15Z) - Learning Deep Tree-based Retriever for Efficient Recommendation: Theory and Method [76.31185707649227]
We propose a Deep Tree-based Retriever (DTR) for efficient recommendation.
DTR frames the training task as a softmax-based multi-class classification over tree nodes at the same level.
To mitigate the suboptimality induced by the labeling of non-leaf nodes, we propose a rectification method for the loss function.
arXiv Detail & Related papers (2024-08-21T05:09:53Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring [16.38771834692938]
We propose a novel framework capable of generating more faithful rationales and, more importantly, matching performance with black-box scoring systems.
We first mimic the human assessment process by querying Large Language Models (LLMs) to generate a thought tree.
We then summarise intermediate assessment decisions from each thought tree path for creating synthetic rationale data and rationale preference data.
arXiv Detail & Related papers (2024-06-28T14:33:05Z) - PORT: Preference Optimization on Reasoning Traces [1.7292887546437081]
This paper proposes using preference optimization methods on Chain-of-Thought steps in order to improve the reasoning performances of language models.
Our approach leads to increased accuracy on the GSM8K, AQuA-RAT, and ARC benchmarks for Falcon2-11B and Mistral-7B.
arXiv Detail & Related papers (2024-06-23T09:51:06Z) - Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees [37.297431187924765]
We propose an inference trajectory optimization framework based on the preference data extracted from decision trees.
Our experiments demonstrate that by obtaining insights from errors in inference trees, TP-LLaMA significantly outperforms the baselines.
arXiv Detail & Related papers (2024-06-11T10:00:18Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - Ladder-of-Thought: Using Knowledge as Steps to Elevate Stance Detection [73.31406286956535]
We introduce the Ladder-of-Thought (LoT) for the stance detection task.
LoT directs the small LMs to assimilate high-quality external knowledge, refining the intermediate rationales produced.
Our empirical evaluations underscore LoT's efficacy, marking a 16% improvement over GPT-3.5 and a 10% enhancement compared to GPT-3.5 with CoT on stance detection task.
arXiv Detail & Related papers (2023-08-31T14:31:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.