To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity
- URL: http://arxiv.org/abs/2407.17125v3
- Date: Fri, 4 Oct 2024 14:36:36 GMT
- Title: To Know or Not To Know? Analyzing Self-Consistency of Large Language Models under Ambiguity
- Authors: Anastasiia Sedova, Robert Litschko, Diego Frassinelli, Benjamin Roth, Barbara Plank,
- Abstract summary: This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities.
Experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts.
- Score: 27.10502683001428
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the major aspects contributing to the striking performance of large language models (LLMs) is the vast amount of factual knowledge accumulated during pre-training. Yet, many LLMs suffer from self-inconsistency, which raises doubts about their trustworthiness and reliability. This paper focuses on entity type ambiguity, analyzing the proficiency and consistency of state-of-the-art LLMs in applying factual knowledge when prompted with ambiguous entities. To do so, we propose an evaluation protocol that disentangles knowing from applying knowledge, and test state-of-the-art LLMs on 49 ambiguous entities. Our experiments reveal that LLMs struggle with choosing the correct entity reading, achieving an average accuracy of only 85%, and as low as 75% with underspecified prompts. The results also reveal systematic discrepancies in LLM behavior, showing that while the models may possess knowledge, they struggle to apply it consistently, exhibit biases toward preferred readings, and display self-inconsistencies. This highlights the need to address entity ambiguity in the future for more trustworthy LLMs.
Related papers
- CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs)
Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs.
Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z) - Towards Logically Consistent Language Models via Probabilistic Reasoning [14.317886666902822]
Large language models (LLMs) are a promising venue for natural language understanding and generation tasks.
LLMs are prone to generate non-factual information and to contradict themselves when prompted to reason about beliefs of the world.
We introduce a training objective that teaches a LLM to be consistent with external knowledge in the form of a set of facts and rules.
arXiv Detail & Related papers (2024-04-19T12:23:57Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - The Calibration Gap between Model and Human Confidence in Large Language
Models [14.539888672603743]
Large language models (LLMs) need to be well-calibrated in the sense that they can accurately assess and communicate how likely it is that their predictions are correct.
Recent work has focused on the quality of internal LLM confidence assessments.
This paper explores the disparity between external human confidence in an LLM's responses and the internal confidence of the model.
arXiv Detail & Related papers (2024-01-24T22:21:04Z) - Benchmarking LLMs via Uncertainty Quantification [91.72588235407379]
The proliferation of open-source Large Language Models (LLMs) has highlighted the urgent need for comprehensive evaluation methods.
We introduce a new benchmarking approach for LLMs that integrates uncertainty quantification.
Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs.
arXiv Detail & Related papers (2024-01-23T14:29:17Z) - TrustLLM: Trustworthiness in Large Language Models [446.5640421311468]
This paper introduces TrustLLM, a comprehensive study of trustworthiness in large language models (LLMs)
We first propose a set of principles for trustworthy LLMs that span eight different dimensions.
Based on these principles, we establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics.
arXiv Detail & Related papers (2024-01-10T22:07:21Z) - Examining LLMs' Uncertainty Expression Towards Questions Outside
Parametric Knowledge [35.067234242461545]
Large language models (LLMs) express uncertainty in situations where they lack sufficient parametric knowledge to generate reasonable responses.
This work aims to systematically investigate LLMs' behaviors in such situations, emphasizing the trade-off between honesty and helpfulness.
arXiv Detail & Related papers (2023-11-16T10:02:40Z) - Are Large Language Models Reliable Judges? A Study on the Factuality
Evaluation Capabilities of LLMs [8.526956860672698]
Large Language Models (LLMs) have gained immense attention due to their notable emergent capabilities.
This study investigates the potential of LLMs as reliable assessors of factual consistency in summaries generated by text-generation models.
arXiv Detail & Related papers (2023-11-01T17:42:45Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.