Semantic Consistency for Assuring Reliability of Large Language Models
- URL: http://arxiv.org/abs/2308.09138v1
- Date: Thu, 17 Aug 2023 18:11:33 GMT
- Title: Semantic Consistency for Assuring Reliability of Large Language Models
- Authors: Harsh Raj, Vipul Gupta, Domenic Rosati, Subhabrata Majumdar
- Abstract summary: Large Language Models (LLMs) exhibit remarkable fluency and competence across various natural language tasks.
We introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various LLMs.
We propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance semantic consistency.
- Score: 9.876355290198639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) exhibit remarkable fluency and competence across
various natural language tasks. However, recent research has highlighted their
sensitivity to variations in input prompts. To deploy LLMs in a safe and
reliable manner, it is crucial for their outputs to be consistent when prompted
with expressions that carry the same meaning or intent. While some existing
work has explored how state-of-the-art LLMs address this issue, their
evaluations have been confined to assessing lexical equality of single- or
multi-word answers, overlooking the consistency of generative text sequences.
For a more comprehensive understanding of the consistency of LLMs in open-ended
text generation scenarios, we introduce a general measure of semantic
consistency, and formulate multiple versions of this metric to evaluate the
performance of various LLMs. Our proposal demonstrates significantly higher
consistency and stronger correlation with human evaluations of output
consistency than traditional metrics based on lexical consistency. Finally, we
propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance
semantic consistency. When evaluated for closed-book question answering based
on answer variations from the TruthfulQA benchmark, A2C increases accuracy
metrics for pretrained and finetuned LLMs by up to 47%, and semantic
consistency metrics for instruction-tuned models by up to 7-fold.
Related papers
- Evaluating Consistencies in LLM responses through a Semantic Clustering of Question Answering [1.9214041945441436]
We present a new approach for evaluating semanticencies of Large Language Model (LLM)
Our approach evaluates whether LLM re-sponses are semantically congruent for a given question, recognizing that as syntactically different sentences may convey the same meaning.
Using the TruthfulQA dataset to assess LLM responses, the study induces N re-sponses per question and clusters semantically equivalent sentences to measure semantic consistency across 37 categories.
arXiv Detail & Related papers (2024-10-20T16:21:25Z) - MM-R$^3$: On (In-)Consistency of Multi-modal Large Language Models (MLLMs) [26.475993408532304]
We study the ability of an MLLM model to produce semantically similar or identical responses to semantically similar queries.
We propose the MM-R$3$ benchmark, which analyses the performance in terms of consistency and accuracy in SoTA MLLMs.
Our analysis reveals that consistency does not always align with accuracy, indicating that models with higher accuracy are not necessarily more consistent, and vice versa.
arXiv Detail & Related papers (2024-10-07T06:36:55Z) - AXCEL: Automated eXplainable Consistency Evaluation using LLMs [6.382787013075262]
Large Language Models (LLMs) are widely used in both industry and academia for various tasks.
This work introduces Automated eXplainable Consistency Evaluation using LLMs (AXCEL)
AXCEL is a prompt-based consistency metric which offers explanations for the consistency scores by providing detailed reasoning.
arXiv Detail & Related papers (2024-09-25T14:45:52Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering [8.019873464066308]
We introduce two metrics for classification tasks, namely sensitivity and consistency.
sensitivity measures changes of predictions across rephrasings of the prompt.
Instead, consistency measures how predictions vary across rephrasings for elements of the same class.
arXiv Detail & Related papers (2024-06-18T06:59:24Z) - RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation.
Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - Measuring Reliability of Large Language Models through Semantic
Consistency [3.4990427823966828]
We develop a measure of semantic consistency that allows the comparison of open-ended text outputs.
We implement several versions of this consistency metric to evaluate the performance of a number of PLMs on paraphrased versions of questions.
arXiv Detail & Related papers (2022-11-10T20:21:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.