NoMIRACL: Knowing When You Don't Know for Robust Multilingual
Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2312.11361v2
- Date: Mon, 4 Mar 2024 16:32:10 GMT
- Title: NoMIRACL: Knowing When You Don't Know for Robust Multilingual
Retrieval-Augmented Generation
- Authors: Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, Odunayo Ogundepo, Ehsan
Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing Chen, Mehdi
Rezagholizadeh, Jimmy Lin
- Abstract summary: Retrieval-augmented generation (RAG) grounds large language model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations.
NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages.
We measure robustness using two metrics: (i) hallucination rate, measuring model tendency to hallucinate an answer, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
- Score: 92.5132418788568
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Retrieval-augmented generation (RAG) grounds large language model (LLM)
output by leveraging external knowledge sources to reduce factual
hallucinations. However, prior works lack a comprehensive evaluation of
different language families, making it challenging to evaluate LLM robustness
against errors in external retrieved knowledge. To overcome this, we establish
NoMIRACL, a human-annotated dataset for evaluating LLM robustness in RAG across
18 typologically diverse languages. NoMIRACL includes both a non-relevant and a
relevant subset. Queries in the non-relevant subset contain passages judged as
non-relevant, whereas queries in the relevant subset include at least a single
judged relevant passage. We measure LLM robustness using two metrics: (i)
hallucination rate, measuring model tendency to hallucinate an answer, when the
answer is not present in passages in the non-relevant subset, and (ii) error
rate, measuring model inaccuracy to recognize relevant passages in the relevant
subset. In our work, we measure robustness for a wide variety of
multilingual-focused LLMs and observe that most of the models struggle to
balance the two capacities. Models such as LLAMA-2, Orca-2, and FLAN-T5 observe
more than an 88% hallucination rate on the non-relevant subset, whereas,
Mistral overall hallucinates less, but can achieve up to a 74.9% error rate on
the relevant subset. Overall, GPT-4 is observed to provide the best tradeoff on
both subsets, highlighting future work necessary to improve LLM robustness.
Related papers
- What do Large Language Models Need for Machine Translation Evaluation? [12.42394213466485]
Large language models (LLMs) can achieve results comparable to fine-tuned multilingual pre-trained language models.
This paper explores what translation information, such as the source, reference, translation errors and annotation guidelines, is needed for LLMs to evaluate machine translation quality.
arXiv Detail & Related papers (2024-10-04T09:50:45Z) - Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models [70.19081534515371]
Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks.
They generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences.
We propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers.
arXiv Detail & Related papers (2024-07-04T18:47:42Z) - Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends [38.86240794422485]
We evaluate the faithfulness of large language models for dialogue summarization.
Our evaluation reveals subtleties as to what constitutes a hallucination.
We introduce two prompt-based approaches for fine-grained error detection that outperform existing metrics.
arXiv Detail & Related papers (2024-06-05T17:49:47Z) - Large Language Models are Inconsistent and Biased Evaluators [2.136983452580014]
We show that Large Language Models (LLMs) are biased evaluators as they exhibit familiarity bias and show skewed distributions of ratings.
We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality.
arXiv Detail & Related papers (2024-05-02T20:42:28Z) - Retrieve Only When It Needs: Adaptive Retrieval Augmentation for Hallucination Mitigation in Large Language Models [68.91592125175787]
Hallucinations pose a significant challenge for the practical implementation of large language models (LLMs)
We present Rowen, a novel approach that enhances LLMs with a selective retrieval augmentation process tailored to address hallucinations.
arXiv Detail & Related papers (2024-02-16T11:55:40Z) - Using Natural Language Explanations to Improve Robustness of In-context Learning [35.18010811754959]
Large language models (LLMs) can excel in many tasks via in-context learning (ICL)
We investigate whether augmenting ICL with natural language explanations (NLEs) improves the robustness of LLMs on adversarial datasets.
arXiv Detail & Related papers (2023-11-13T18:49:13Z) - MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks [12.665447518524187]
This study aims to perform a thorough evaluation of the non-English capabilities of SoTA LLMs by comparing them on the same set of multilingual datasets.
Our benchmark comprises 22 datasets covering 83 languages, including low-resource African languages.
We also perform a study on data contamination and find that several models are likely to be contaminated with multilingual evaluation benchmarks.
arXiv Detail & Related papers (2023-11-13T16:45:37Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Benchmarking Large Language Models in Retrieval-Augmented Generation [53.504471079548]
We systematically investigate the impact of Retrieval-Augmented Generation on large language models.
We analyze the performance of different large language models in 4 fundamental abilities required for RAG.
We establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese.
arXiv Detail & Related papers (2023-09-04T08:28:44Z) - Language models are not naysayers: An analysis of language models on
negation benchmarks [58.32362243122714]
We evaluate the ability of current-generation auto-regressive language models to handle negation.
We show that LLMs have several limitations including insensitivity to the presence of negation, an inability to capture the lexical semantics of negation, and a failure to reason under negation.
arXiv Detail & Related papers (2023-06-14T01:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.