TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability
- URL: http://arxiv.org/abs/2406.01855v1
- Date: Tue, 4 Jun 2024 00:01:35 GMT
- Title: TruthEval: A Dataset to Evaluate LLM Truthfulness and Reliability
- Authors: Aisha Khatun, Daniel G. Brown,
- Abstract summary: We present a curated collection of challenging statements on sensitive topics for benchmarking called TruthEval.
These statements were curated by hand and contain known truth values.
We perform some initial analyses using this dataset and find several instances of LLMs failing in simple tasks showing their inability to understand simple questions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Model (LLM) evaluation is currently one of the most important areas of research, with existing benchmarks proving to be insufficient and not completely representative of LLMs' various capabilities. We present a curated collection of challenging statements on sensitive topics for LLM benchmarking called TruthEval. These statements were curated by hand and contain known truth values. The categories were chosen to distinguish LLMs' abilities from their stochastic nature. We perform some initial analyses using this dataset and find several instances of LLMs failing in simple tasks showing their inability to understand simple questions.
Related papers
- Decompose and Aggregate: A Step-by-Step Interpretable Evaluation Framework [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs)
Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs.
Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z) - Large Language Models are Inconsistent and Biased Evaluators [2.136983452580014]
We show that Large Language Models (LLMs) are biased evaluators as they exhibit familiarity bias and show skewed distributions of ratings.
We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality.
arXiv Detail & Related papers (2024-05-02T20:42:28Z) - Benchmarking LLMs via Uncertainty Quantification [91.72588235407379]
The proliferation of open-source Large Language Models (LLMs) has highlighted the urgent need for comprehensive evaluation methods.
We introduce a new benchmarking approach for LLMs that integrates uncertainty quantification.
Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs.
arXiv Detail & Related papers (2024-01-23T14:29:17Z) - State of What Art? A Call for Multi-Prompt LLM Evaluation [28.307860675006545]
We comprehensively analyze the brittleness of results obtained via single-prompt evaluations across 6.5M instances.
To improve robustness of the analysis, we propose to evaluate LLMs with a set of diverse prompts instead.
arXiv Detail & Related papers (2023-12-31T22:21:36Z) - NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language
Models via Complexity Classes [32.154637177467684]
NPHardEval is designed to evaluate the reasoning abilities of Large Language Models (LLMs) across a broad spectrum of 900 questions.
It is meticulously chosen to represent a wide range of complexity class below the NP-hard complexity class.
It is designed with a dynamic update mechanism, where the datapoints are refreshed on a monthly basis.
arXiv Detail & Related papers (2023-12-22T18:07:44Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z) - Sentiment Analysis in the Era of Large Language Models: A Reality Check [69.97942065617664]
This paper investigates the capabilities of large language models (LLMs) in performing various sentiment analysis tasks.
We evaluate performance across 13 tasks on 26 datasets and compare the results against small language models (SLMs) trained on domain-specific datasets.
arXiv Detail & Related papers (2023-05-24T10:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.