Self-Training Meets Consistency: Improving LLMs' Reasoning with Consistency-Driven Rationale Evaluation
- URL: http://arxiv.org/abs/2411.06387v4
- Date: Thu, 06 Feb 2025 07:07:28 GMT
- Title: Self-Training Meets Consistency: Improving LLMs' Reasoning with Consistency-Driven Rationale Evaluation
- Authors: Jaehyeok Lee, Keisuke Sakaguchi, JinYeong Bak,
- Abstract summary: Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales.
Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training.
We propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions.
- Score: 15.124701883286436
- License:
- Abstract: Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. However, a single measure risks misjudging rationale quality, leading the models to learn flawed reasoning patterns. To address this issue, we propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions and leverages this evaluation to guide its training. Specifically, we introduce two methods: (1) filtering out rationales that frequently result in incorrect answers on follow-up questions and (2) preference learning based on mixed preferences from rationale evaluation results of both original and follow-up questions. Experiments on three question-answering datasets using open LLMs show that CREST not only improves the logical robustness and correctness of rationales but also improves reasoning abilities compared to previous self-training approaches.
Related papers
- ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification [53.80183105328448]
Refine via Intrinsic Self-Verification (ReVISE) is an efficient framework that enables LLMs to self-correct their outputs through self-verification.
Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves reasoning performance.
arXiv Detail & Related papers (2025-02-20T13:50:02Z) - S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.
Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta [2.1249213103048414]
We introduce the EQUATOR Evaluator, which combines deterministic scoring with a focus on factual accuracy and robust reasoning assessment.
Using a vector database, EQUATOR pairs open-ended questions with human-evaluated answers, enabling more precise and scalable evaluations.
Our results demonstrate that this framework significantly outperforms traditional multiple-choice evaluations while maintaining high accuracy standards.
arXiv Detail & Related papers (2024-12-31T03:56:17Z) - Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.40788744292739]
We propose a two-player paradigm that separates the roles of reasoning and critique models.
We first propose AutoMathCritique, an automated and scalable framework for collecting critique data.
We demonstrate that the critique models consistently improve the actor's performance on difficult queries at test-time.
arXiv Detail & Related papers (2024-11-25T17:11:54Z) - Self-Consistency Preference Optimization [79.37880123635405]
We introduce self-consistency preference optimization (ScPO)
ScPO iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems.
On ZebraLogic, ScPO fine Llamatunes-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
arXiv Detail & Related papers (2024-11-06T18:36:22Z) - RL-STaR: Theoretical Analysis of Reinforcement Learning Frameworks for Self-Taught Reasoner [2.779063752888881]
Self-taught reasoner (STaR) framework uses reinforcement learning to automatically generate reasoning steps.
STaR and its variants have demonstrated empirical success, but a theoretical foundation explaining these improvements is lacking.
This work provides a theoretical framework for understanding the effectiveness of reinforcement learning on CoT reasoning and STaR.
arXiv Detail & Related papers (2024-10-31T13:17:53Z) - Rationale-Aware Answer Verification by Pairwise Self-Evaluation [11.763229353978321]
We show that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers.
Our results suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers.
arXiv Detail & Related papers (2024-10-07T08:53:00Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - Does Self-Rationalization Improve Robustness to Spurious Correlations? [19.553357015260687]
We ask whether training models to self-rationalize can aid in their learning to solve tasks for the right reasons.
We evaluate robustness to spurious correlations in fine-tuned encoder-decoder and decoder-only models of six different sizes.
We find that while self-rationalization can improve robustness to spurious correlations in low-resource settings, it tends to hurt robustness in higher-resource settings.
arXiv Detail & Related papers (2022-10-24T19:54:57Z) - Why do you think that? Exploring Faithful Sentence-Level Rationales
Without Supervision [60.62434362997016]
We propose a differentiable training-framework to create models which output faithful rationales on a sentence level.
Our model solves the task based on each rationale individually and learns to assign high scores to those which solved the task best.
arXiv Detail & Related papers (2020-10-07T12:54:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.