Benchmarking and Improving Generator-Validator Consistency of Language
Models
- URL: http://arxiv.org/abs/2310.01846v1
- Date: Tue, 3 Oct 2023 07:23:22 GMT
- Title: Benchmarking and Improving Generator-Validator Consistency of Language
Models
- Authors: Xiang Lisa Li, Vaishnavi Shrivastava, Siyan Li, Tatsunori Hashimoto,
Percy Liang
- Abstract summary: inconsistency between generating and validating an answer is prevalent in language models (LMs)
Even GPT-4, a state-of-the-art LM, is GV-consistent only 76% of the time.
We find that this approach improves GV-consistency of Alpaca-30B from 60% to 93%.
- Score: 82.73914625520686
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As of September 2023, ChatGPT correctly answers "what is 7+8" with 15, but
when asked "7+8=15, True or False" it responds with "False". This inconsistency
between generating and validating an answer is prevalent in language models
(LMs) and erodes trust. In this paper, we propose a framework for measuring the
consistency between generation and validation (which we call
generator-validator consistency, or GV-consistency), finding that even GPT-4, a
state-of-the-art LM, is GV-consistent only 76% of the time. To improve the
consistency of LMs, we propose to finetune on the filtered generator and
validator responses that are GV-consistent, and call this approach consistency
fine-tuning. We find that this approach improves GV-consistency of Alpaca-30B
from 60% to 93%, and the improvement extrapolates to unseen tasks and domains
(e.g., GV-consistency for positive style transfers extrapolates to unseen
styles like humor). In addition to improving consistency, consistency
fine-tuning improves both generator quality and validator accuracy without
using any labeled data. Evaluated across 6 tasks, including math questions,
knowledge-intensive QA, and instruction following, our method improves the
generator quality by 16% and the validator accuracy by 6.3% across all tasks.
Related papers
- Self-Supervised Learning Based Handwriting Verification [23.983430206133793]
We show that ResNet based Variational Auto-Encoder (VAE) outperforms other generative approaches achieving 76.3% accuracy.
Using a pre-trained VAE and VICReg for the downstream task of writer verification we observed a relative improvement in accuracy of 6.7% and 9% over ResNet-18 supervised baseline with 10% writer labels.
arXiv Detail & Related papers (2024-05-28T16:11:11Z) - Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability [58.582216812183496]
Language models (LMs) can sometimes generate factually correct text and estimate truth values of individual claims.
Current LMs generate incorrect or nonsensical content, and are difficult to edit and bring up to date.
We present a method called Deductive Closure Training (DCT) that uses LMs themselves to identify implications of (and contradictions within) the text that they generate.
arXiv Detail & Related papers (2024-01-16T18:58:37Z) - Progressive-Hint Prompting Improves Reasoning in Large Language Models [63.98629132836499]
This paper proposes a new prompting method, named Progressive-Hint Prompting (PHP)
It enables automatic multiple interactions between users and Large Language Models (LLMs) by using previously generated answers as hints to progressively guide toward the correct answers.
We conducted extensive and comprehensive experiments on seven benchmarks. The results show that PHP significantly improves accuracy while remaining highly efficient.
arXiv Detail & Related papers (2023-04-19T16:29:48Z) - Faithful Chain-of-Thought Reasoning [51.21714389639417]
Chain-of-Thought (CoT) prompting boosts Language Models' (LM) performance on a gamut of reasoning tasks.
We propose Faithful CoT, a reasoning framework involving two stages: Translation and Problem Solving.
This guarantees that the reasoning chain provides a faithful explanation of the final answer.
arXiv Detail & Related papers (2023-01-31T03:04:26Z) - WeCheck: Strong Factual Consistency Checker via Weakly Supervised
Learning [40.5830891229718]
We propose a weakly supervised framework that aggregates multiple resources to train a precise and efficient factual metric, namely WeCheck.
Comprehensive experiments on a variety of tasks demonstrate the strong performance of WeCheck, which achieves a 3.4% absolute improvement over previous state-of-the-art methods on TRUE benchmark on average.
arXiv Detail & Related papers (2022-12-20T08:04:36Z) - Self-Consistency Improves Chain of Thought Reasoning in Language Models [53.45015291520658]
We explore a simple ensemble strategy, self-consistency, that significantly improves the reasoning accuracy of large language models.
For arithmetic and commonsense reasoning benchmarks we find that self-consistency yields significant accuracy improvements.
arXiv Detail & Related papers (2022-03-21T17:48:52Z) - PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs.
Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm.
We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.