Related papers: Benchmarking and Improving Generator-Validator Consistency of Language Models

Benchmarking and Improving Generator-Validator Consistency of Language Models

URL: http://arxiv.org/abs/2310.01846v1
Date: Tue, 3 Oct 2023 07:23:22 GMT
Title: Benchmarking and Improving Generator-Validator Consistency of Language Models
Authors: Xiang Lisa Li, Vaishnavi Shrivastava, Siyan Li, Tatsunori Hashimoto, Percy Liang
Abstract summary: inconsistency between generating and validating an answer is prevalent in language models (LMs) Even GPT-4, a state-of-the-art LM, is GV-consistent only 76% of the time. We find that this approach improves GV-consistency of Alpaca-30B from 60% to 93%.
Score: 82.73914625520686
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As of September 2023, ChatGPT correctly answers "what is 7+8" with 15, but when asked "7+8=15, True or False" it responds with "False". This inconsistency between generating and validating an answer is prevalent in language models (LMs) and erodes trust. In this paper, we propose a framework for measuring the consistency between generation and validation (which we call generator-validator consistency, or GV-consistency), finding that even GPT-4, a state-of-the-art LM, is GV-consistent only 76% of the time. To improve the consistency of LMs, we propose to finetune on the filtered generator and validator responses that are GV-consistent, and call this approach consistency fine-tuning. We find that this approach improves GV-consistency of Alpaca-30B from 60% to 93%, and the improvement extrapolates to unseen tasks and domains (e.g., GV-consistency for positive style transfers extrapolates to unseen styles like humor). In addition to improving consistency, consistency fine-tuning improves both generator quality and validator accuracy without using any labeled data. Evaluated across 6 tasks, including math questions, knowledge-intensive QA, and instruction following, our method improves the generator quality by 16% and the validator accuracy by 6.3% across all tasks.

Related papers

ConAIR:Consistency-Augmented Iterative Interaction Framework to Enhance the Reliability of Code Generation [17.68163468068264]
We propose a Consistency-Augmented Iterative Interaction Framework to enhance the reliability of Code Generation, ConAIR. We show that with minimal human effort, performance can be significantly enhanced.
arXiv Detail & Related papers (2024-11-23T15:26:24Z)
Self-Supervised Learning Based Handwriting Verification [23.983430206133793]
We show that ResNet based Variational Auto-Encoder (VAE) outperforms other generative approaches achieving 76.3% accuracy. Using a pre-trained VAE and VICReg for the downstream task of writer verification we observed a relative improvement in accuracy of 6.7% and 9% over ResNet-18 supervised baseline with 10% writer labels.
arXiv Detail & Related papers (2024-05-28T16:11:11Z)
Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability [58.582216812183496]
Language models (LMs) can sometimes generate factually correct text and estimate truth values of individual claims. Current LMs generate incorrect or nonsensical content, and are difficult to edit and bring up to date. We present a method called Deductive Closure Training (DCT) that uses LMs themselves to identify implications of (and contradictions within) the text that they generate.
arXiv Detail & Related papers (2024-01-16T18:58:37Z)
Progressive-Hint Prompting Improves Reasoning in Large Language Models [63.98629132836499]
This paper proposes a new prompting method, named Progressive-Hint Prompting (PHP) It enables automatic multiple interactions between users and Large Language Models (LLMs) by using previously generated answers as hints to progressively guide toward the correct answers. We conducted extensive and comprehensive experiments on seven benchmarks. The results show that PHP significantly improves accuracy while remaining highly efficient.
arXiv Detail & Related papers (2023-04-19T16:29:48Z)
Faithful Chain-of-Thought Reasoning [51.21714389639417]
Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of reasoning tasks. We propose Faithful CoT, a reasoning framework involving two stages: Translation and Problem Solving. This guarantees that the reasoning chain provides a faithful explanation of the final answer.
arXiv Detail & Related papers (2023-01-31T03:04:26Z)
Self-Consistency Improves Chain of Thought Reasoning in Language Models [53.45015291520658]
We explore a simple ensemble strategy, self-consistency, that significantly improves the reasoning accuracy of large language models. For arithmetic and commonsense reasoning benchmarks we find that self-consistency yields significant accuracy improvements.
arXiv Detail & Related papers (2022-03-21T17:48:52Z)
PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs. Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm. We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.