PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations
- URL: http://arxiv.org/abs/2405.19740v2
- Date: Fri, 18 Oct 2024 06:57:08 GMT
- Title: PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge-Invariant Perturbations
- Authors: Jiatong Li, Renjun Hu, Kunzhe Huang, Yan Zhuang, Qi Liu, Mengxiao Zhu, Xing Shi, Wei Lin,
- Abstract summary: We present PertEval, a toolkit for probing large language models' knowledge capacity.
PertEval employs human-like restatement techniques to generate on-the-fly test samples from static benchmarks.
Our findings provide insights for advancing more robust and genuinely knowledgeable LLMs.
- Score: 22.011216436252845
- License:
- Abstract: Expert-designed close-ended benchmarks are indispensable in assessing the knowledge capacity of large language models (LLMs). Despite their widespread use, concerns have mounted regarding their reliability due to limited test scenarios and an unavoidable risk of data contamination. To rectify this, we present PertEval, a toolkit devised for in-depth probing of LLMs' knowledge capacity through \textbf{knowledge-invariant perturbations}. These perturbations employ human-like restatement techniques to generate on-the-fly test samples from static benchmarks, meticulously retaining knowledge-critical content while altering irrelevant details. Our toolkit further includes a suite of \textbf{response consistency analyses} that compare performance on raw vs. perturbed test sets to precisely assess LLMs' genuine knowledge capacity. Six representative LLMs are re-evaluated using PertEval. Results reveal significantly inflated performance of the LLMs on raw benchmarks, including an absolute 25.8% overestimation for GPT-4. Additionally, through a nuanced response pattern analysis, we discover that PertEval retains LLMs' uncertainty to specious knowledge, and reveals their potential rote memorization to correct options which leads to overestimated performance. We also find that the detailed response consistency analyses by PertEval could illuminate various weaknesses in existing LLMs' knowledge mastery and guide the development of refinement. Our findings provide insights for advancing more robust and genuinely knowledgeable LLMs. Our code is available at \url{https://github.com/aigc-apps/PertEval}.
Related papers
- Beyond Binary: Towards Fine-Grained LLM-Generated Text Detection via Role Recognition and Involvement Measurement [51.601916604301685]
Large language models (LLMs) generate content that can undermine trust in online discourse.
Current methods often focus on binary classification, failing to address the complexities of real-world scenarios like human-AI collaboration.
To move beyond binary classification and address these challenges, we propose a new paradigm for detecting LLM-generated content.
arXiv Detail & Related papers (2024-10-18T08:14:10Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - Finding Blind Spots in Evaluator LLMs with Interpretable Checklists [23.381287828102995]
We investigate the effectiveness of Large Language Models (LLMs) as evaluators for text generation tasks.
We propose FBI, a novel framework designed to examine the proficiency of Evaluator LLMs in assessing four critical abilities.
arXiv Detail & Related papers (2024-06-19T10:59:48Z) - UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions [10.28688988951815]
UBENCH is a benchmark for evaluating large language models.
It includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities.
We also evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding.
arXiv Detail & Related papers (2024-06-18T16:50:38Z) - CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs)
Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs.
Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z) - Is Factuality Enhancement a Free Lunch For LLMs? Better Factuality Can Lead to Worse Context-Faithfulness [39.74642729786543]
We argue that current factuality enhancement methods can significantly undermine context-faithfulness of large language models (LLMs)
Experiments reveal that while these methods may yield inconsistent improvements in factual accuracy, they also cause a more severe decline in context-faithfulness.
arXiv Detail & Related papers (2024-03-30T02:08:28Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.