Empirical evaluation of Uncertainty Quantification in
Retrieval-Augmented Language Models for Science
- URL: http://arxiv.org/abs/2311.09358v1
- Date: Wed, 15 Nov 2023 20:42:11 GMT
- Title: Empirical evaluation of Uncertainty Quantification in
Retrieval-Augmented Language Models for Science
- Authors: Sridevi Wagle, Sai Munikoti, Anurag Acharya, Sara Smith, Sameera
Horawalavithana
- Abstract summary: This study investigates how uncertainty scores vary when scientific knowledge is incorporated as pretraining and retrieval data.
We observe that an existing RALM finetuned with scientific knowledge as the retrieval data tends to be more confident in generating predictions.
We also found that RALMs are overconfident in their predictions, making inaccurate predictions more confidently than accurate ones.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have shown remarkable achievements in natural
language processing tasks, producing high-quality outputs. However, LLMs still
exhibit limitations, including the generation of factually incorrect
information. In safety-critical applications, it is important to assess the
confidence of LLM-generated content to make informed decisions. Retrieval
Augmented Language Models (RALMs) is relatively a new area of research in NLP.
RALMs offer potential benefits for scientific NLP tasks, as retrieved
documents, can serve as evidence to support model-generated content. This
inclusion of evidence enhances trustworthiness, as users can verify and explore
the retrieved documents to validate model outputs. Quantifying uncertainty in
RALM generations further improves trustworthiness, with retrieved text and
confidence scores contributing to a comprehensive and reliable model for
scientific applications. However, there is limited to no research on UQ for
RALMs, particularly in scientific contexts. This study aims to address this gap
by conducting a comprehensive evaluation of UQ in RALMs, focusing on scientific
tasks. This research investigates how uncertainty scores vary when scientific
knowledge is incorporated as pretraining and retrieval data and explores the
relationship between uncertainty scores and the accuracy of model-generated
outputs. We observe that an existing RALM finetuned with scientific knowledge
as the retrieval data tends to be more confident in generating predictions
compared to the model pretrained only with scientific knowledge. We also found
that RALMs are overconfident in their predictions, making inaccurate
predictions more confidently than accurate ones. Scientific knowledge provided
either as pretraining or retrieval corpus does not help alleviate this issue.
We released our code, data and dashboards at https://github.com/pnnl/EXPERT2.
Related papers
- Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception [58.62352010928591]
Large language models (LLMs) exhibit impressive performance across diverse tasks but often struggle to accurately gauge their knowledge boundaries.
This paper explores leveraging LLMs' internal states to enhance their perception of knowledge boundaries from efficiency and risk perspectives.
arXiv Detail & Related papers (2025-02-17T11:11:09Z) - Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators.
It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts.
We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z) - Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators [6.403926452181712]
Large Language Models (LLMs) tend to be unreliable in the factuality of their answers.
We present a survey and empirical comparison of estimators of factual confidence.
Our experiments indicate that trained hidden-state probes provide the most reliable confidence estimates.
arXiv Detail & Related papers (2024-06-19T10:11:37Z) - RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing [0.2302001830524133]
This survey paper addresses the absence of a comprehensive overview on Retrieval-Augmented Language Models (RALMs)
The paper discusses the essential components of RALMs, including Retrievers, Language Models, and Augmentations.
RALMs demonstrate utility in a spectrum of tasks, from translation and dialogue systems to knowledge-intensive applications.
arXiv Detail & Related papers (2024-04-30T13:14:51Z) - TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness [58.721012475577716]
Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications.
This paper introduces TrustScore, a framework based on the concept of Behavioral Consistency, which evaluates whether an LLMs response aligns with its intrinsic knowledge.
arXiv Detail & Related papers (2024-02-19T21:12:14Z) - Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models [54.55088169443828]
Chain-of-Noting (CoN) is a novel approach aimed at improving the robustness of RALMs in facing noisy, irrelevant documents and in handling unknown scenarios.
CoN achieves an average improvement of +7.9 in EM score given entirely noisy retrieved documents and +10.5 in rejection rates for real-time questions that fall outside the pre-training knowledge scope.
arXiv Detail & Related papers (2023-11-15T18:54:53Z) - Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability.
In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling.
Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z) - Evaluating the Effectiveness of Retrieval-Augmented Large Language
Models in Scientific Document Reasoning [0.0]
Large Language Model (LLM) often provide seemingly plausible but not factual information, often referred to as hallucinations.
Retrieval-augmented LLMs provide a non-parametric approach to solve these issues by retrieving relevant information from external data sources.
We critically evaluate these models in their ability to perform in scientific document reasoning tasks.
arXiv Detail & Related papers (2023-11-07T21:09:57Z) - Improving the Reliability of Large Language Models by Leveraging
Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination"
We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z) - FELM: Benchmarking Factuality Evaluation of Large Language Models [40.78878196872095]
We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm.
We collect responses generated from large language models and annotate factuality labels in a fine-grained manner.
Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
arXiv Detail & Related papers (2023-10-01T17:37:31Z) - Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models [37.63939774027709]
Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities.
We propose and compare several confidence/uncertainty measures, applying them to *selective NLG* where unreliable results could either be ignored or yielded for further assessment.
Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses.
arXiv Detail & Related papers (2023-05-30T16:31:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.