Related papers: Rationale-Aware Answer Verification by Pairwise Self-Evaluation

Rationale-Aware Answer Verification by Pairwise Self-Evaluation

URL: http://arxiv.org/abs/2410.04838v2
Date: Fri, 25 Oct 2024 09:11:41 GMT
Title: Rationale-Aware Answer Verification by Pairwise Self-Evaluation
Authors: Akira Kawabata, Saku Sugawara,
Abstract summary: We show that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers. Our results suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers.
Score: 11.763229353978321
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Answer verification identifies correct solutions among candidates generated by large language models (LLMs). Current approaches typically train verifier models by labeling solutions as correct or incorrect based solely on whether the final answer matches the gold answer. However, this approach neglects any flawed rationale in the solution yielding the correct answer, undermining the verifier's ability to distinguish between sound and flawed rationales. We empirically show that in StrategyQA, only 19% of LLM-generated solutions with correct answers have valid rationales, thus leading to an unreliable verifier. Furthermore, we demonstrate that training a verifier on valid rationales significantly improves its ability to distinguish valid and flawed rationale. To make a better verifier without extra human supervision, we introduce REPS (Rationale Enhancement through Pairwise Selection), a method for selecting valid rationales from candidates by iteratively applying pairwise self-evaluation using the same LLM that generates the solutions. Verifiers trained on solutions selected by REPS outperform those trained using conventional training methods on three reasoning benchmarks (ARC-Challenge, DROP, and StrategyQA). Our results suggest that training reliable verifiers requires ensuring the validity of rationales in addition to the correctness of the final answers, which would be critical for models assisting humans in solving complex reasoning tasks.

Related papers

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
Scalable Best-of-N Selection for Large Language Models via Self-Certainty [65.31658824274894]
Best-of-N selection is a key technique for improving the reasoning performance of Large Language Models. We propose self-certainty, a novel and efficient metric to estimate response quality without requiring external reward models. Our findings establish self-certainty as a practical and efficient way for improving LLM reasoning capabilities.
arXiv Detail & Related papers (2025-02-25T19:08:07Z)
ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification [53.80183105328448]
Refine via Intrinsic Self-Verification (ReVISE) is an efficient framework that enables LLMs to self-correct their outputs through self-verification. Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves reasoning performance.
arXiv Detail & Related papers (2025-02-20T13:50:02Z)
Self-Training Meets Consistency: Improving LLMs' Reasoning With Consistency-Driven Rationale Evaluation [15.124701883286436]
Self-training approach for large language models (LLMs) improves reasoning abilities by training the models on their self-generated rationales. Previous approaches have labeled rationales that produce correct answers for a given question as appropriate for training. We propose CREST (Consistency-driven Rationale Evaluation for Self-Training), a self-training framework that further evaluates each rationale through follow-up questions.
arXiv Detail & Related papers (2024-11-10T08:11:05Z)
Self-Consistency Preference Optimization [79.37880123635405]
We introduce self-consistency preference optimization (ScPO) ScPO iteratively trains consistent answers to be preferred over inconsistent ones on unsupervised new problems. On ZebraLogic, ScPO fine Llamatunes-3 8B to be superior to Llama-3 70B, Gemma-2 27B, and Claude-3 Haiku.
arXiv Detail & Related papers (2024-11-06T18:36:22Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
Generative Verifiers: Reward Modeling as Next-Token Prediction [29.543787728397643]
Verifiers or reward models are often used to enhance the reasoning performance of large language models (LLMs) We propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation. We demonstrate that GenRM outperforms discriminative, DPO verifiers, and LLM-as-a-Judge.
arXiv Detail & Related papers (2024-08-27T17:57:45Z)
CoT Rerailer: Enhancing the Reliability of Large Language Models in Complex Reasoning Tasks through Error Detection and Correction [9.44858963874474]
Chain-of-Thought (CoT) prompting enhances Large Language Models (LLMs) complex reasoning abilities. We propose the CoT Rerailer to address these challenges, employing self-consistency and multi-agent debate systems. We demonstrate the effectiveness of our approach across diverse question-answering datasets in various knowledge domains.
arXiv Detail & Related papers (2024-08-25T21:20:17Z)
Prover-Verifier Games improve legibility of LLM outputs [12.532113917099885]
We study legibility in the context of solving grade-school math problems. We propose a training algorithm inspired by Prover-Verifier Game from Anil et al. We show that legibility training transfers to time-constrained humans tasked with verifying solution correctness.
arXiv Detail & Related papers (2024-07-18T16:58:18Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
V-STaR: Training Verifiers for Self-Taught Reasoners [71.53113558733227]
V-STaR trains a verifier using DPO that judges correctness of model-generated solutions. Running V-STaR for multiple iterations results in progressively better reasoners and verifiers.
arXiv Detail & Related papers (2024-02-09T15:02:56Z)
A Mutual Information Maximization Approach for the Spurious Solution Problem in Weakly Supervised Question Answering [60.768146126094955]
Weakly supervised question answering usually has only the final answers as supervision signals. There may exist many spurious solutions that coincidentally derive the correct answer, but training on such solutions can hurt model performance. We propose to explicitly exploit such semantic correlations by maximizing the mutual information between question-answer pairs and predicted solutions.
arXiv Detail & Related papers (2021-06-14T05:47:41Z)
Why do you think that? Exploring Faithful Sentence-Level Rationales Without Supervision [60.62434362997016]
We propose a differentiable training-framework to create models which output faithful rationales on a sentence level. Our model solves the task based on each rationale individually and learns to assign high scores to those which solved the task best.
arXiv Detail & Related papers (2020-10-07T12:54:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.