Related papers: PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations

URL: http://arxiv.org/abs/2405.19740v1
Date: Thu, 30 May 2024 06:38:32 GMT
Title: PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations
Authors: Jiatong Li, Renjun Hu, Kunzhe Huang, Yan Zhuang, Qi Liu, Mengxiao Zhu, Xing Shi, Wei Lin,
Abstract summary: We present PertEval, a toolkit for in-depth probing of large language models' knowledge capacity. PertEval employs human-like restatement techniques to generate on-the-fly test samples from static benchmarks. We show that PertEval can act as an essential tool that, when applied alongside any close-ended benchmark, unveils the true knowledge capacity of LLMs.
Score: 22.011216436252845
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Expert-designed close-ended benchmarks serve as vital tools in assessing the knowledge capacity of large language models (LLMs). Despite their widespread use, concerns have mounted regarding their reliability due to limited test scenarios and an unavoidable risk of data contamination. To rectify this, we present PertEval, a toolkit devised for in-depth probing of LLMs' knowledge capacity through knowledge-invariant perturbations. These perturbations employ human-like restatement techniques to generate on-the-fly test samples from static benchmarks, meticulously retaining knowledge-critical content while altering irrelevant details. Our toolkit further includes a suite of transition analyses that compare performance on raw vs. perturbed test sets to precisely assess LLMs' genuine knowledge capacity. Six state-of-the-art LLMs are re-evaluated using PertEval. Results reveal significantly inflated performance of the LLMs on raw benchmarks, including an absolute 21% overestimation for GPT-4. Additionally, through a nuanced response pattern analysis, we discover that PertEval retains LLMs' uncertainty to specious knowledge, potentially being resolved through rote memorization and leading to inflated performance. We also find that the detailed transition analyses by PertEval could illuminate weaknesses in existing LLMs' knowledge mastery and guide the development of refinement. Given these insights, we posit that PertEval can act as an essential tool that, when applied alongside any close-ended benchmark, unveils the true knowledge capacity of LLMs, marking a significant step toward more trustworthy LLM evaluation.

Related papers

Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness [44.37155305736321]
Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs)<n>We propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge.<n>Our framework provides a more realistic and rigorous assessment of unlearning performance.
arXiv Detail & Related papers (2025-06-06T04:35:19Z)
Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception [58.62352010928591]
Large language models (LLMs) exhibit impressive performance across diverse tasks but often struggle to accurately gauge their knowledge boundaries. This paper explores leveraging LLMs' internal states to enhance their perception of knowledge boundaries from efficiency and risk perspectives.
arXiv Detail & Related papers (2025-02-17T11:11:09Z)
UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models [41.67393607081513]
Large Language Models (LLMs) often struggle to accurately express the factual knowledge they possess. We propose the UAlign framework, which leverages Uncertainty estimations to represent knowledge boundaries. We show that the proposed UAlign can significantly enhance the LLMs' capacities to confidently answer known questions.
arXiv Detail & Related papers (2024-12-16T14:14:27Z)
The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? [1.3810901729134184]
Large Language Models (LLMs) excel at standardized tests while failing to demonstrate genuine language understanding and adaptability. Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum. We lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks.
arXiv Detail & Related papers (2024-12-02T20:49:21Z)
Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement [51.601916604301685]
Large language models (LLMs) generate content that can undermine trust in online discourse. Current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-AI collaboration. To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content.
arXiv Detail & Related papers (2024-10-18T08:14:10Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
How Reliable are LLMs as Knowledge Bases? Re-thinking Facutality and Consistency [60.25969380388974]
Large Language Models (LLMs) are increasingly explored as knowledge bases (KBs) Current evaluation methods focus too narrowly on knowledge retention, overlooking other crucial criteria for reliable performance. We propose new criteria and metrics to quantify factuality and consistency, leading to a final reliability score.
arXiv Detail & Related papers (2024-07-18T15:20:18Z)
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks. We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z)
UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions [10.28688988951815]
UBENCH is a benchmark for evaluating large language models. It includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities. We also evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding.
arXiv Detail & Related papers (2024-06-18T16:50:38Z)
CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs) Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs. Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z)
Is Factuality Enhancement a Free Lunch For LLMs? Better Factuality Can Lead to Worse Context-Faithfulness [39.74642729786543]
We argue that current factuality enhancement methods can significantly undermine context-faithfulness of large language models (LLMs) Experiments reveal that while these methods may yield inconsistent improvements in factual accuracy, they also cause a more severe decline in context-faithfulness.
arXiv Detail & Related papers (2024-03-30T02:08:28Z)
Certifying Knowledge Comprehension in LLMs [3.6293956720749425]
We introduce the first specification and certification framework for knowledge comprehension in Large Language Models (LLMs) Instead of a fixed dataset, we design novel specifications that mathematically represent prohibitively large probability distributions of knowledge comprehension prompts with natural noise. We apply our framework to certify SOTA LLMs in two domains: precision medicine and general question-answering.
arXiv Detail & Related papers (2024-02-24T23:16:57Z)
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases. We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets. Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.