Self-Evaluation Improves Selective Generation in Large Language Models
- URL: http://arxiv.org/abs/2312.09300v1
- Date: Thu, 14 Dec 2023 19:09:22 GMT
- Title: Self-Evaluation Improves Selective Generation in Large Language Models
- Authors: Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, Balaji Lakshminarayanan
- Abstract summary: We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
- Score: 54.003992911447696
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Safe deployment of large language models (LLMs) may benefit from a reliable
method for assessing their generated content to determine when to abstain or to
selectively generate. While likelihood-based metrics such as perplexity are
widely employed, recent research has demonstrated the limitations of using
sequence-level probability estimates given by LLMs as reliable indicators of
generation quality. Conversely, LLMs have demonstrated strong calibration at
the token level, particularly when it comes to choosing correct answers in
multiple-choice questions or evaluating true/false statements. In this work, we
reformulate open-ended generation tasks into token-level prediction tasks, and
leverage LLMs' superior calibration at the token level. We instruct an LLM to
self-evaluate its answers, employing either a multi-way comparison or a
point-wise evaluation approach, with the option to include a ``None of the
above'' option to express the model's uncertainty explicitly. We benchmark a
range of scoring methods based on self-evaluation and evaluate their
performance in selective generation using TruthfulQA and TL;DR. Through
experiments with PaLM-2 and GPT-3, we demonstrate that self-evaluation based
scores not only improve accuracy, but also correlate better with the overall
quality of generated content.
Related papers
- Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - CSS: Contrastive Semantic Similarity for Uncertainty Quantification of LLMs [1.515687944002438]
We propose Contrastive Semantic Similarity, a module to obtain similarity features for measuring uncertainty for text pairs.
We conduct extensive experiments with three large language models (LLMs) on several benchmark question-answering datasets.
Results show that our proposed method performs better in estimating reliable responses of LLMs than comparable baselines.
arXiv Detail & Related papers (2024-06-05T11:35:44Z) - Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation [37.63939774027709]
We propose enhancing the predicted sequence probability by assigning different weights to various tokens.
We refer to this new score as the Contextualized Sequence Likelihood (CSL)
arXiv Detail & Related papers (2024-06-03T21:55:07Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs [56.526095828316386]
We propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of large language models (LLMs)
We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods.
arXiv Detail & Related papers (2023-10-18T03:34:59Z) - Calibrating LLM-Based Evaluator [92.17397504834825]
We propose AutoCalibrate, a multi-stage, gradient-free approach to calibrate and align an LLM-based evaluator toward human preference.
Instead of explicitly modeling human preferences, we first implicitly encompass them within a set of human labels.
Our experiments on multiple text quality evaluation datasets illustrate a significant improvement in correlation with expert evaluation through calibration.
arXiv Detail & Related papers (2023-09-23T08:46:11Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Out-of-Distribution Detection and Selective Generation for Conditional
Language Models [40.15896981028647]
Conditional language models (CLMs) are predominantly trained to classify the next token in an output sequence.
We present a highly accurate and lightweight OOD detection method for CLMs.
We show how our method can be used under the common and realistic setting of distribution shift for selective generation of high-quality outputs.
arXiv Detail & Related papers (2022-09-30T16:17:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.