Related papers: CONFLARE: CONFormal LArge language model REtrieval

CONFLARE: CONFormal LArge language model REtrieval

URL: http://arxiv.org/abs/2404.04287v1
Date: Thu, 4 Apr 2024 02:58:21 GMT
Title: CONFLARE: CONFormal LArge language model REtrieval
Authors: Pouria Rouzrokh, Shahriar Faghani, Cooper U. Gamble, Moein Shariatnia, Bradley J. Erickson,
Abstract summary: Retrieval-augmented generation (RAG) frameworks enable large language models (LLMs) to retrieve relevant information from a knowledge base and incorporate it into the context for generating responses. RAG does not guarantee valid responses if retrieval fails to identify the necessary information as the context for response generation. We introduce a four-step framework for applying conformal prediction to quantify retrieval uncertainty in RAG frameworks.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-augmented generation (RAG) frameworks enable large language models (LLMs) to retrieve relevant information from a knowledge base and incorporate it into the context for generating responses. This mitigates hallucinations and allows for the updating of knowledge without retraining the LLM. However, RAG does not guarantee valid responses if retrieval fails to identify the necessary information as the context for response generation. Also, if there is contradictory content, the RAG response will likely reflect only one of the two possible responses. Therefore, quantifying uncertainty in the retrieval process is crucial for ensuring RAG trustworthiness. In this report, we introduce a four-step framework for applying conformal prediction to quantify retrieval uncertainty in RAG frameworks. First, a calibration set of questions answerable from the knowledge base is constructed. Each question's embedding is compared against document embeddings to identify the most relevant document chunks containing the answer and record their similarity scores. Given a user-specified error rate ({\alpha}), these similarity scores are then analyzed to determine a similarity score cutoff threshold. During inference, all chunks with similarity exceeding this threshold are retrieved to provide context to the LLM, ensuring the true answer is captured in the context with a (1-{\alpha}) confidence level. We provide a Python package that enables users to implement the entire workflow proposed in our work, only using LLMs and without human intervention.

Related papers

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation [108.13261761812517]
We introduce FRANQ (Faithfulness-based Retrieval Augmented UNcertainty Quantification), a novel method for hallucination detection in RAG outputs.<n>We present a new long-form Question Answering (QA) dataset annotated for both factuality and faithfulness.
arXiv Detail & Related papers (2025-05-27T11:56:59Z)
SAGE: A Framework of Precise Retrieval for RAG [9.889395372896153]
Retrieval-augmented generation (RAG) has demonstrated significant proficiency in conducting question-answering tasks. RAG methods segment the corpus without considering semantics, making it difficult to find relevant context. We introduce a RAG framework (SAGE) to overcome these limitations.
arXiv Detail & Related papers (2025-03-03T16:25:58Z)
Riddle Me This! Stealthy Membership Inference for Retrieval-Augmented Generation [18.098228823748617]
We present Interrogation Attack (IA), a membership inference technique targeting documents in the RAG datastore. We demonstrate successful inference with just 30 queries while remaining stealthy. We observe a 2x improvement in TPR@1%FPR over prior inference attacks across diverse RAG configurations.
arXiv Detail & Related papers (2025-02-01T04:01:18Z)
Aligning Large Language Models for Faithful Integrity Against Opposing Argument [71.33552795870544]
Large Language Models (LLMs) have demonstrated impressive capabilities in complex reasoning tasks. They can be easily misled by unfaithful arguments during conversations, even when their original statements are correct. We propose a novel framework, named Alignment for Faithful Integrity with Confidence Estimation.
arXiv Detail & Related papers (2025-01-02T16:38:21Z)
ChunkRAG: Novel LLM-Chunk Filtering Method for RAG Systems [2.8692611791027893]
Retrieval-Augmented Generation (RAG) systems generate inaccurate responses due to the retrieval of irrelevant or loosely related information. We propose ChunkRAG, a framework that enhances RAG systems by evaluating and filtering retrieved information at the chunk level.
arXiv Detail & Related papers (2024-10-25T14:07:53Z)
RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions [52.33835101586687]
Conversational AI agents use Retrieval Augmented Generation (RAG) to provide verifiable document-grounded responses to user inquiries. This paper presents a novel synthetic data generation method to efficiently create a diverse set of context-grounded confusing questions from a given document corpus.
arXiv Detail & Related papers (2024-10-18T16:11:29Z)
Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation [72.70046559930555]
We propose a generic RAG approach called Adaptive Note-Enhanced RAG (Adaptive-Note) for complex QA tasks. Specifically, Adaptive-Note introduces an overarching view of knowledge growth, iteratively gathering new information in the form of notes. In addition, we employ an adaptive, note-based stop-exploration strategy to decide "what to retrieve and when to stop" to encourage sufficient knowledge exploration.
arXiv Detail & Related papers (2024-10-11T14:03:29Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios. With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
Optimization of Retrieval-Augmented Generation Context with Outlier Detection [0.0]
We focus on methods to reduce the size and improve the quality of the prompt context required for question-answering systems. Our goal is to select the most semantically relevant documents, treating the discarded ones as outliers. It was found that the greatest improvements were achieved with increasing complexity of the questions and answers.
arXiv Detail & Related papers (2024-07-01T15:53:29Z)
Model Internals-based Answer Attribution for Trustworthy Retrieval-Augmented Generation [8.975024781390077]
We present MIRAGE --Model Internals-based RAG Explanations -- a plug-and-play approach using model internals for faithful answer attribution in question answering applications. We evaluate our proposed approach on a multilingual QA dataset, finding high agreement with human answer attribution.
arXiv Detail & Related papers (2024-06-19T16:10:26Z)
RE-RAG: Improving Open-Domain QA Performance and Interpretability with Relevance Estimator in Retrieval-Augmented Generation [5.10832476049103]
We propose a relevance estimator (RE) that provides relative relevance between contexts as previous rerankers did. We show that RE trained with a small generator (sLM) can not only improve the sLM fine-tuned together with RE but also improve previously unreferenced large language models.
arXiv Detail & Related papers (2024-06-09T14:11:19Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Mitigating LLM Hallucinations via Conformal Abstention [70.83870602967625]
We develop a principled procedure for determining when a large language model should abstain from responding in a general domain. We leverage conformal prediction techniques to develop an abstention procedure that benefits from rigorous theoretical guarantees on the hallucination rate (error rate) Experimentally, our resulting conformal abstention method reliably bounds the hallucination rate on various closed-book, open-domain generative question answering datasets.
arXiv Detail & Related papers (2024-04-04T11:32:03Z)
Query Performance Prediction using Relevance Judgments Generated by Large Language Models [53.97064615557883]
We propose a QPP framework using automatically generated relevance judgments (QPP-GenRE) QPP-GenRE decomposes QPP into independent subtasks of predicting relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels.
arXiv Detail & Related papers (2024-04-01T09:33:05Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.