Related papers: Measuring Reliability of Large Language Models through Semantic Consistency

Measuring Reliability of Large Language Models through Semantic Consistency

URL: http://arxiv.org/abs/2211.05853v2
Date: Tue, 11 Apr 2023 18:53:23 GMT
Title: Measuring Reliability of Large Language Models through Semantic Consistency
Authors: Harsh Raj, Domenic Rosati, Subhabrata Majumdar
Abstract summary: We develop a measure of semantic consistency that allows the comparison of open-ended text outputs. We implement several versions of this consistency metric to evaluate the performance of a number of PLMs on paraphrased versions of questions.
Score: 3.4990427823966828
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large pretrained language models (PLMs) demonstrate incredible fluency and performance on many natural language tasks, recent work has shown that well-performing PLMs are very sensitive to what prompts are feed into them. Even when prompts are semantically identical, language models may give very different answers. When considering safe and trustworthy deployments of PLMs we would like their outputs to be consistent under prompts that mean the same thing or convey the same intent. While some work has looked into how state-of-the-art PLMs address this need, they have been limited to only evaluating lexical equality of single- or multi-word answers and do not address consistency of generative text sequences. In order to understand consistency of PLMs under text generation settings, we develop a measure of semantic consistency that allows the comparison of open-ended text outputs. We implement several versions of this consistency metric to evaluate the performance of a number of PLMs on paraphrased versions of questions in the TruthfulQA dataset, we find that our proposed metrics are considerably more consistent than traditional metrics embodying lexical consistency, and also correlate with human evaluation of output consistency to a higher degree.

Related papers

Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles [32.121191446326876]
We identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks.<n>Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for measuring LLM performance.
arXiv Detail & Related papers (2025-07-29T18:59:09Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
PICASO: Permutation-Invariant Context Composition with State Space Models [98.91198288025117]
State Space Models (SSMs) offer a promising solution by allowing a database of contexts to be mapped onto fixed-dimensional states. We propose a simple mathematical relation derived from SSM dynamics to compose multiple states into one that efficiently approximates the effect of concatenating raw context tokens. We evaluate our resulting method on WikiText and MSMARCO in both zero-shot and fine-tuned settings, and show that we can match the strongest performing baseline while enjoying on average 5.4x speedup.
arXiv Detail & Related papers (2025-02-24T19:48:00Z)
Improving Consistency in Large Language Models through Chain of Guidance [9.040736633675136]
Chain of Guidance (CoG) is a multistep prompting technique that generates highly consistent outputs from Large Language Models (LLMs) We use synthetic data sets comprised of consistent input-output pairs to fine-tune LLMs to produce consistent and correct outputs. Our fine-tuned models are more than twice as consistent compared to base models and show strong generalization capabilities by producing consistent outputs over datasets not used in the fine-tuning process.
arXiv Detail & Related papers (2025-02-21T20:41:37Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs) We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy. We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
Localizing Factual Inconsistencies in Attributable Text Generation [74.11403803488643]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.<n>We show that QASemConsistency yields factual consistency scores that correlate well with human judgments.
arXiv Detail & Related papers (2024-10-09T22:53:48Z)
AXCEL: Automated eXplainable Consistency Evaluation using LLMs [6.382787013075262]
Large Language Models (LLMs) are widely used in both industry and academia for various tasks. This work introduces Automated eXplainable Consistency Evaluation using LLMs (AXCEL) AXCEL is a prompt-based consistency metric which offers explanations for the consistency scores by providing detailed reasoning.
arXiv Detail & Related papers (2024-09-25T14:45:52Z)
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering [8.019873464066308]
We introduce two metrics for classification tasks, namely sensitivity and consistency. sensitivity measures changes of predictions across rephrasings of the prompt. Instead, consistency measures how predictions vary across rephrasings for elements of the same class.
arXiv Detail & Related papers (2024-06-18T06:59:24Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
Contrastive Instruction Tuning [61.97704869248903]
We propose Contrastive Instruction Tuning to maximize the similarity between semantically equivalent instruction-instance pairs. Experiments on the PromptBench benchmark show that CoIN consistently improves LLMs' robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy.
arXiv Detail & Related papers (2024-02-17T00:09:32Z)
Which Syntactic Capabilities Are Statistically Learned by Masked Language Models for Code? [51.29970742152668]
We highlight relying on accuracy-based measurements may lead to an overestimation of models' capabilities. To address these issues, we introduce a technique called SyntaxEval in Syntactic Capabilities.
arXiv Detail & Related papers (2024-01-03T02:44:02Z)
Evaluation Metrics of Language Generation Models for Synthetic Traffic Generation Tasks [22.629816738693254]
We show that common NLG metrics, like BLEU, are not suitable for evaluating Synthetic Traffic Generation (STG) We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts.
arXiv Detail & Related papers (2023-11-21T11:26:26Z)
Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context. Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS) Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z)
Semantic Consistency for Assuring Reliability of Large Language Models [9.876355290198639]
Large Language Models (LLMs) exhibit remarkable fluency and competence across various natural language tasks. We introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various LLMs. We propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance semantic consistency.
arXiv Detail & Related papers (2023-08-17T18:11:33Z)
MetricPrompt: Prompting Model as a Relevance Metric for Few-shot Text Classification [65.51149771074944]
MetricPrompt eases verbalizer design difficulty by reformulating few-shot text classification task into text pair relevance estimation task. We conduct experiments on three widely used text classification datasets across four few-shot settings. Results show that MetricPrompt outperforms manual verbalizer and other automatic verbalizer design methods across all few-shot settings.
arXiv Detail & Related papers (2023-06-15T06:51:35Z)
MURMUR: Modular Multi-Step Reasoning for Semi-Structured Data-to-Text Generation [102.20036684996248]
We propose MURMUR, a neuro-symbolic modular approach to text generation from semi-structured data with multi-step reasoning. We conduct experiments on two data-to-text generation tasks like WebNLG and LogicNLG.
arXiv Detail & Related papers (2022-12-16T17:36:23Z)
Towards Computationally Verifiable Semantic Grounding for Language Models [18.887697890538455]
The paper conceptualizes the LM as a conditional model generating text given a desired semantic message formalized as a set of entity-relationship triples. It embeds the LM in an auto-encoder by feeding its output to a semantic fluency whose output is in the same representation domain as the input message. We show that our proposed approaches significantly improve on the greedy search baseline.
arXiv Detail & Related papers (2022-11-16T17:35:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.