Empirical evaluation of Uncertainty Quantification in
Retrieval-Augmented Language Models for Science
- URL: http://arxiv.org/abs/2311.09358v1
- Date: Wed, 15 Nov 2023 20:42:11 GMT
- Title: Empirical evaluation of Uncertainty Quantification in
Retrieval-Augmented Language Models for Science
- Authors: Sridevi Wagle, Sai Munikoti, Anurag Acharya, Sara Smith, Sameera
Horawalavithana
- Abstract summary: This study investigates how uncertainty scores vary when scientific knowledge is incorporated as pretraining and retrieval data.
We observe that an existing RALM finetuned with scientific knowledge as the retrieval data tends to be more confident in generating predictions.
We also found that RALMs are overconfident in their predictions, making inaccurate predictions more confidently than accurate ones.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have shown remarkable achievements in natural
language processing tasks, producing high-quality outputs. However, LLMs still
exhibit limitations, including the generation of factually incorrect
information. In safety-critical applications, it is important to assess the
confidence of LLM-generated content to make informed decisions. Retrieval
Augmented Language Models (RALMs) is relatively a new area of research in NLP.
RALMs offer potential benefits for scientific NLP tasks, as retrieved
documents, can serve as evidence to support model-generated content. This
inclusion of evidence enhances trustworthiness, as users can verify and explore
the retrieved documents to validate model outputs. Quantifying uncertainty in
RALM generations further improves trustworthiness, with retrieved text and
confidence scores contributing to a comprehensive and reliable model for
scientific applications. However, there is limited to no research on UQ for
RALMs, particularly in scientific contexts. This study aims to address this gap
by conducting a comprehensive evaluation of UQ in RALMs, focusing on scientific
tasks. This research investigates how uncertainty scores vary when scientific
knowledge is incorporated as pretraining and retrieval data and explores the
relationship between uncertainty scores and the accuracy of model-generated
outputs. We observe that an existing RALM finetuned with scientific knowledge
as the retrieval data tends to be more confident in generating predictions
compared to the model pretrained only with scientific knowledge. We also found
that RALMs are overconfident in their predictions, making inaccurate
predictions more confidently than accurate ones. Scientific knowledge provided
either as pretraining or retrieval corpus does not help alleviate this issue.
We released our code, data and dashboards at https://github.com/pnnl/EXPERT2.
Related papers
- Quantitative Insights into Language Model Usage and Trust in Academia: An Empirical Study [29.750000639372203]
There is a notable gap in quantitative evidence regarding the extent of LM usage, user trust in their outputs, and issues to prioritize for real-world development.
This study surveyed 125 individuals at a private school and secured 88 data points after pre-processing.
Through both quantitative analysis and qualitative evidence, we found a significant variation in trust levels, which are strongly related to usage time and frequency.
arXiv Detail & Related papers (2024-09-13T20:45:50Z) - RAG and RAU: A Survey on Retrieval-Augmented Language Model in Natural Language Processing [0.2302001830524133]
This survey paper addresses the absence of a comprehensive overview on Retrieval-Augmented Language Models (RALMs)
The paper discusses the essential components of RALMs, including Retrievers, Language Models, and Augmentations.
RALMs demonstrate utility in a spectrum of tasks, from translation and dialogue systems to knowledge-intensive applications.
arXiv Detail & Related papers (2024-04-30T13:14:51Z) - TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness [58.721012475577716]
Large Language Models (LLMs) have demonstrated impressive capabilities across various domains, prompting a surge in their practical applications.
This paper introduces TrustScore, a framework based on the concept of Behavioral Consistency, which evaluates whether an LLMs response aligns with its intrinsic knowledge.
arXiv Detail & Related papers (2024-02-19T21:12:14Z) - Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models [54.55088169443828]
Chain-of-Noting (CoN) is a novel approach aimed at improving the robustness of RALMs in facing noisy, irrelevant documents and in handling unknown scenarios.
CoN achieves an average improvement of +7.9 in EM score given entirely noisy retrieved documents and +10.5 in rejection rates for real-time questions that fall outside the pre-training knowledge scope.
arXiv Detail & Related papers (2023-11-15T18:54:53Z) - Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability.
In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling.
Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z) - Evaluating the Effectiveness of Retrieval-Augmented Large Language
Models in Scientific Document Reasoning [0.0]
Large Language Model (LLM) often provide seemingly plausible but not factual information, often referred to as hallucinations.
Retrieval-augmented LLMs provide a non-parametric approach to solve these issues by retrieving relevant information from external data sources.
We critically evaluate these models in their ability to perform in scientific document reasoning tasks.
arXiv Detail & Related papers (2023-11-07T21:09:57Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - Improving the Reliability of Large Language Models by Leveraging
Uncertainty-Aware In-Context Learning [76.98542249776257]
Large-scale language models often face the challenge of "hallucination"
We introduce an uncertainty-aware in-context learning framework to empower the model to enhance or reject its output in response to uncertainty.
arXiv Detail & Related papers (2023-10-07T12:06:53Z) - FELM: Benchmarking Factuality Evaluation of Large Language Models [40.78878196872095]
We introduce a benchmark for Factuality Evaluation of large Language Models, referred to as felm.
We collect responses generated from large language models and annotate factuality labels in a fine-grained manner.
Our findings reveal that while retrieval aids factuality evaluation, current LLMs are far from satisfactory to faithfully detect factual errors.
arXiv Detail & Related papers (2023-10-01T17:37:31Z) - Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models [37.63939774027709]
Large language models (LLMs) specializing in natural language generation (NLG) have recently started exhibiting promising capabilities.
We propose and compare several confidence/uncertainty measures, applying them to *selective NLG* where unreliable results could either be ignored or yielded for further assessment.
Results reveal that a simple measure for the semantic dispersion can be a reliable predictor of the quality of LLM responses.
arXiv Detail & Related papers (2023-05-30T16:31:26Z) - Context-faithful Prompting for Large Language Models [51.194410884263135]
Large language models (LLMs) encode parametric knowledge about world facts.
Their reliance on parametric knowledge may cause them to overlook contextual cues, leading to incorrect predictions in context-sensitive NLP tasks.
We assess and enhance LLMs' contextual faithfulness in two aspects: knowledge conflict and prediction with abstention.
arXiv Detail & Related papers (2023-03-20T17:54:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.