Related papers: AXCEL: Automated eXplainable Consistency Evaluation using LLMs

AXCEL: Automated eXplainable Consistency Evaluation using LLMs

URL: http://arxiv.org/abs/2409.16984v1
Date: Wed, 25 Sep 2024 14:45:52 GMT
Title: AXCEL: Automated eXplainable Consistency Evaluation using LLMs
Authors: P Aditya Sreekar, Sahil Verma, Suransh Chopra, Sarik Ghazarian, Abhishek Persad, Narayanan Sadagopan,
Abstract summary: Large Language Models (LLMs) are widely used in both industry and academia for various tasks. This work introduces Automated eXplainable Consistency Evaluation using LLMs (AXCEL) AXCEL is a prompt-based consistency metric which offers explanations for the consistency scores by providing detailed reasoning.
Score: 6.382787013075262
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Models (LLMs) are widely used in both industry and academia for various tasks, yet evaluating the consistency of generated text responses continues to be a challenge. Traditional metrics like ROUGE and BLEU show a weak correlation with human judgment. More sophisticated metrics using Natural Language Inference (NLI) have shown improved correlations but are complex to implement, require domain-specific training due to poor cross-domain generalization, and lack explainability. More recently, prompt-based metrics using LLMs as evaluators have emerged; while they are easier to implement, they still lack explainability and depend on task-specific prompts, which limits their generalizability. This work introduces Automated eXplainable Consistency Evaluation using LLMs (AXCEL), a prompt-based consistency metric which offers explanations for the consistency scores by providing detailed reasoning and pinpointing inconsistent text spans. AXCEL is also a generalizable metric which can be adopted to multiple tasks without changing the prompt. AXCEL outperforms both non-prompt and prompt-based state-of-the-art (SOTA) metrics in detecting inconsistencies across summarization by 8.7%, free text generation by 6.2%, and data-to-text conversion tasks by 29.4%. We also evaluate the influence of underlying LLMs on prompt based metric performance and recalibrate the SOTA prompt-based metrics with the latest LLMs for fair comparison. Further, we show that AXCEL demonstrates strong performance using open source LLMs.

Related papers

Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation [46.697788643450785]
Large language models (LLMs) have been found to produce outputs that are incomplete or selectively omit key information.<n>In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies.
arXiv Detail & Related papers (2025-10-09T08:22:24Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [80.88654868264645]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z)
What Has Been Lost with Synthetic Evaluation? [43.773053236733425]
Large language models (LLMs) are increasingly used for data generation.<n>We investigate whether LLMs can meet demands by generating reasoning over-text benchmarks.<n>We show that they are less challenging for LLMs than their human-authored counterparts.
arXiv Detail & Related papers (2025-05-28T20:12:32Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering [8.019873464066308]
We introduce two metrics for classification tasks, namely sensitivity and consistency. sensitivity measures changes of predictions across rephrasings of the prompt. Instead, consistency measures how predictions vary across rephrasings for elements of the same class.
arXiv Detail & Related papers (2024-06-18T06:59:24Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models [46.07900122810749]
Large language models (LLMs) have achieved unprecedented performances in various applications, yet evaluating them is still challenging. We contend that utilizing existing relational databases is a promising approach for constructing benchmarks. We propose ERBench, which uses these integrity constraints to convert any database into an LLM benchmark.
arXiv Detail & Related papers (2024-03-08T12:42:36Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
Benchmarking LLMs on the Semantic Overlap Summarization Task [9.656095701778975]
This paper comprehensively evaluates Large Language Models (LLMs) on the Semantic Overlap Summarization (SOS) task. We report well-established metrics like ROUGE, BERTscore, and SEM-F1$ on two different datasets of alternative narratives.
arXiv Detail & Related papers (2024-02-26T20:33:50Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
Semantic Consistency for Assuring Reliability of Large Language Models [9.876355290198639]
Large Language Models (LLMs) exhibit remarkable fluency and competence across various natural language tasks. We introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various LLMs. We propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance semantic consistency.
arXiv Detail & Related papers (2023-08-17T18:11:33Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
Measuring Reliability of Large Language Models through Semantic Consistency [3.4990427823966828]
We develop a measure of semantic consistency that allows the comparison of open-ended text outputs. We implement several versions of this consistency metric to evaluate the performance of a number of PLMs on paraphrased versions of questions.
arXiv Detail & Related papers (2022-11-10T20:21:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.