NoMIRACL: Knowing When You Don't Know for Robust Multilingual
Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2312.11361v2
- Date: Mon, 4 Mar 2024 16:32:10 GMT
- Title: NoMIRACL: Knowing When You Don't Know for Robust Multilingual
Retrieval-Augmented Generation
- Authors: Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, Odunayo Ogundepo, Ehsan
Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing Chen, Mehdi
Rezagholizadeh, Jimmy Lin
- Abstract summary: Retrieval-augmented generation (RAG) grounds large language model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations.
NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages.
We measure robustness using two metrics: (i) hallucination rate, measuring model tendency to hallucinate an answer, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
- Score: 92.5132418788568
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Retrieval-augmented generation (RAG) grounds large language model (LLM)
output by leveraging external knowledge sources to reduce factual
hallucinations. However, prior works lack a comprehensive evaluation of
different language families, making it challenging to evaluate LLM robustness
against errors in external retrieved knowledge. To overcome this, we establish
NoMIRACL, a human-annotated dataset for evaluating LLM robustness in RAG across
18 typologically diverse languages. NoMIRACL includes both a non-relevant and a
relevant subset. Queries in the non-relevant subset contain passages judged as
non-relevant, whereas queries in the relevant subset include at least a single
judged relevant passage. We measure LLM robustness using two metrics: (i)
hallucination rate, measuring model tendency to hallucinate an answer, when the
answer is not present in passages in the non-relevant subset, and (ii) error
rate, measuring model inaccuracy to recognize relevant passages in the relevant
subset. In our work, we measure robustness for a wide variety of
multilingual-focused LLMs and observe that most of the models struggle to
balance the two capacities. Models such as LLAMA-2, Orca-2, and FLAN-T5 observe
more than an 88% hallucination rate on the non-relevant subset, whereas,
Mistral overall hallucinates less, but can achieve up to a 74.9% error rate on
the relevant subset. Overall, GPT-4 is observed to provide the best tradeoff on
both subsets, highlighting future work necessary to improve LLM robustness.
Related papers
- Preference Leakage: A Contamination Problem in LLM-as-a-judge [69.96778498636071]
Large Language Models (LLMs) as judges and LLM-based data synthesis have emerged as two fundamental LLM-driven data annotation methods.
In this work, we expose preference leakage, a contamination problem in LLM-as-a-judge caused by the relatedness between the synthetic data generators and LLM-based evaluators.
arXiv Detail & Related papers (2025-02-03T17:13:03Z) - Addressing Hallucinations with RAG and NMISS in Italian Healthcare LLM Chatbots [0.0]
I combine detection and mitigation techniques to addresses hallucinations in Large Language Models (LLMs)
Mitigation is achieved in a question-answer-ing Retrieval-Augmented Generation (RAG) framework while detection is obtained by introducing the Negative Missing Information Scoring System (NMISS)
This combined approach offers new insights into the reduction and more accurate assessment of hallucinations in LLMs, with applications in real-world healthcare tasks and other domains.
arXiv Detail & Related papers (2024-12-05T15:11:12Z) - Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends [38.86240794422485]
We evaluate the faithfulness of large language models for dialogue summarization.
Our evaluation reveals subtleties as to what constitutes a hallucination.
We introduce two prompt-based approaches for fine-grained error detection that outperform existing metrics.
arXiv Detail & Related papers (2024-06-05T17:49:47Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Detecting Hallucination and Coverage Errors in Retrieval Augmented Generation for Controversial Topics [16.874364446070967]
We explore a strategy to handle controversial topics in LLM-based chatbots based on Wikipedia's Neutral Point of View (NPOV) principle.
We use a deterministic retrieval system and then focus on common LLM failure modes that arise during this approach to text generation, namely hallucination and coverage errors.
We show that our methods still yield good results on hallucination (84.0%) and coverage error (85.2%) detection.
arXiv Detail & Related papers (2024-03-13T18:47:00Z) - Retrieve Only When It Needs: Adaptive Retrieval Augmentation for Hallucination Mitigation in Large Language Models [68.91592125175787]
Hallucinations pose a significant challenge for the practical implementation of large language models (LLMs)
We present Rowen, a novel approach that enhances LLMs with a selective retrieval augmentation process tailored to address hallucinations.
arXiv Detail & Related papers (2024-02-16T11:55:40Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.