Rationale-Aware Answer Verification by Pairwise Self-Evaluation
- URL: http://arxiv.org/abs/2410.04838v2
- Date: Fri, 25 Oct 2024 09:11:41 GMT
- Title: Rationale-Aware Answer Verification by Pairwise Self-Evaluation
- Authors: Akira Kawabata, Saku Sugawara,
- Abstract summary: We show that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers.
Our results suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers.
- Score: 11.763229353978321
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Answer verification identifies correct solutions among candidates generated by large language models (LLMs). Current approaches typically train verifier models by labeling solutions as correct or incorrect based solely on whether the final answer matches the gold answer. However, this approach neglects any flawed rationale in the solution yielding the correct answer, undermining the verifier's ability to distinguish between sound and flawed rationales. We empirically show that in StrategyQA, only 19% of LLM-generated solutions with correct answers have valid rationales, thus leading to an unreliable verifier. Furthermore, we demonstrate that training a verifier on valid rationales significantly improves its ability to distinguish valid and flawed rationale. To make a better verifier without extra human supervision, we introduce REPS (Rationale Enhancement through Pairwise Selection), a method for selecting valid rationales from candidates by iteratively applying pairwise self-evaluation using the same LLM that generates the solutions. Verifiers trained on solutions selected by REPS outperform those trained using conventional training methods on three reasoning benchmarks (ARC-Challenge, DROP, and StrategyQA). Our results suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers, which would be critical for models assisting humans in solving complex reasoning tasks.
Related papers
- Self-Training Meets Consistency: Improving LLMs' Reasoning With Consistency-Driven Rationale Evaluation [15.124701883286436]
Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales.
Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training.
We propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions.
arXiv Detail & Related papers (2024-11-10T08:11:05Z) - Self-Consistency Preference Optimization [79.37880123635405]
We introduce self-consistency preference optimization (ScPO)
ScPO iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems.
On ZebraLogic, ScPO fine Llamatunes-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
arXiv Detail & Related papers (2024-11-06T18:36:22Z) - Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning.
LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors.
We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z) - Generative Verifiers: Reward Modeling as Next-Token Prediction [29.543787728397643]
Verifiers or reward models are often used to enhance the reasoning performance of large language models (LLMs)
We propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation.
We demonstrate that GenRM outperforms discriminative, DPO verifiers, and LLM-as-a-Judge.
arXiv Detail & Related papers (2024-08-27T17:57:45Z) - CoT Rerailer: Enhancing the Reliability of Large Language Models in Complex Reasoning Tasks through Error Detection and Correction [9.44858963874474]
Chain-of-Thought (CoT) prompting enhances Large Language Models (LLMs) complex reasoning abilities.
We propose the CoT Rerailer to address these challenges, employing self-consistency and multi-agent debate systems.
We demonstrate the effectiveness of our approach across diverse question-answering datasets in various knowledge domains.
arXiv Detail & Related papers (2024-08-25T21:20:17Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - V-STaR: Training Verifiers for Self-Taught Reasoners [71.53113558733227]
V-STaR trains a verifier using DPO that judges correctness of model-generated solutions.
Running V-STaR for multiple iterations results in progressively better reasoners and verifiers.
arXiv Detail & Related papers (2024-02-09T15:02:56Z) - A Mutual Information Maximization Approach for the Spurious Solution
Problem in Weakly Supervised Question Answering [60.768146126094955]
Weakly supervised question answering usually has only the final answers as supervision signals.
There may exist many spurious solutions that coincidentally derive the correct answer, but training on such solutions can hurt model performance.
We propose to explicitly exploit such semantic correlations by maximizing the mutual information between question-answer pairs and predicted solutions.
arXiv Detail & Related papers (2021-06-14T05:47:41Z) - Why do you think that? Exploring Faithful Sentence-Level Rationales
Without Supervision [60.62434362997016]
We propose a differentiable training-framework to create models which output faithful rationales on a sentence level.
Our model solves the task based on each rationale individually and learns to assign high scores to those which solved the task best.
arXiv Detail & Related papers (2020-10-07T12:54:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.