Measuring Reliability of Large Language Models through Semantic
Consistency
- URL: http://arxiv.org/abs/2211.05853v2
- Date: Tue, 11 Apr 2023 18:53:23 GMT
- Title: Measuring Reliability of Large Language Models through Semantic
Consistency
- Authors: Harsh Raj, Domenic Rosati, Subhabrata Majumdar
- Abstract summary: We develop a measure of semantic consistency that allows the comparison of open-ended text outputs.
We implement several versions of this consistency metric to evaluate the performance of a number of PLMs on paraphrased versions of questions.
- Score: 3.4990427823966828
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While large pretrained language models (PLMs) demonstrate incredible fluency
and performance on many natural language tasks, recent work has shown that
well-performing PLMs are very sensitive to what prompts are feed into them.
Even when prompts are semantically identical, language models may give very
different answers. When considering safe and trustworthy deployments of PLMs we
would like their outputs to be consistent under prompts that mean the same
thing or convey the same intent. While some work has looked into how
state-of-the-art PLMs address this need, they have been limited to only
evaluating lexical equality of single- or multi-word answers and do not address
consistency of generative text sequences. In order to understand consistency of
PLMs under text generation settings, we develop a measure of semantic
consistency that allows the comparison of open-ended text outputs. We implement
several versions of this consistency metric to evaluate the performance of a
number of PLMs on paraphrased versions of questions in the TruthfulQA dataset,
we find that our proposed metrics are considerably more consistent than
traditional metrics embodying lexical consistency, and also correlate with
human evaluation of output consistency to a higher degree.
Related papers
- AXCEL: Automated eXplainable Consistency Evaluation using LLMs [6.382787013075262]
Large Language Models (LLMs) are widely used in both industry and academia for various tasks.
This work introduces Automated eXplainable Consistency Evaluation using LLMs (AXCEL)
AXCEL is a prompt-based consistency metric which offers explanations for the consistency scores by providing detailed reasoning.
arXiv Detail & Related papers (2024-09-25T14:45:52Z) - What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering [8.019873464066308]
We introduce two metrics for classification tasks, namely sensitivity and consistency.
sensitivity measures changes of predictions across rephrasings of the prompt.
Instead, consistency measures how predictions vary across rephrasings for elements of the same class.
arXiv Detail & Related papers (2024-06-18T06:59:24Z) - PPTC-R benchmark: Towards Evaluating the Robustness of Large Language
Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels.
We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings.
We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z) - Contrastive Instruction Tuning [61.97704869248903]
We propose Contrastive Instruction Tuning to maximize the similarity between semantically equivalent instruction-instance pairs.
Experiments on the PromptBench benchmark show that CoIN consistently improves LLMs' robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy.
arXiv Detail & Related papers (2024-02-17T00:09:32Z) - Which Syntactic Capabilities Are Statistically Learned by Masked
Language Models for Code? [51.29970742152668]
We highlight relying on accuracy-based measurements may lead to an overestimation of models' capabilities.
To address these issues, we introduce a technique called SyntaxEval in Syntactic Capabilities.
arXiv Detail & Related papers (2024-01-03T02:44:02Z) - Evaluation Metrics of Language Generation Models for Synthetic Traffic
Generation Tasks [22.629816738693254]
We show that common NLG metrics, like BLEU, are not suitable for evaluating Synthetic Traffic Generation (STG)
We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts.
arXiv Detail & Related papers (2023-11-21T11:26:26Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - Semantic Consistency for Assuring Reliability of Large Language Models [9.876355290198639]
Large Language Models (LLMs) exhibit remarkable fluency and competence across various natural language tasks.
We introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various LLMs.
We propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance semantic consistency.
arXiv Detail & Related papers (2023-08-17T18:11:33Z) - MetricPrompt: Prompting Model as a Relevance Metric for Few-shot Text
Classification [65.51149771074944]
MetricPrompt eases verbalizer design difficulty by reformulating few-shot text classification task into text pair relevance estimation task.
We conduct experiments on three widely used text classification datasets across four few-shot settings.
Results show that MetricPrompt outperforms manual verbalizer and other automatic verbalizer design methods across all few-shot settings.
arXiv Detail & Related papers (2023-06-15T06:51:35Z) - MURMUR: Modular Multi-Step Reasoning for Semi-Structured Data-to-Text
Generation [102.20036684996248]
We propose MURMUR, a neuro-symbolic modular approach to text generation from semi-structured data with multi-step reasoning.
We conduct experiments on two data-to-text generation tasks like WebNLG and LogicNLG.
arXiv Detail & Related papers (2022-12-16T17:36:23Z) - Towards Computationally Verifiable Semantic Grounding for Language
Models [18.887697890538455]
The paper conceptualizes the LM as a conditional model generating text given a desired semantic message formalized as a set of entity-relationship triples.
It embeds the LM in an auto-encoder by feeding its output to a semantic fluency whose output is in the same representation domain as the input message.
We show that our proposed approaches significantly improve on the greedy search baseline.
arXiv Detail & Related papers (2022-11-16T17:35:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.