Zero-Shot Multi-task Hallucination Detection
- URL: http://arxiv.org/abs/2403.12244v1
- Date: Mon, 18 Mar 2024 20:50:26 GMT
- Title: Zero-Shot Multi-task Hallucination Detection
- Authors: Patanjali Bhamidipati, Advaith Malladi, Manish Shrivastava, Radhika Mamidi,
- Abstract summary: hallucination is an emergent condition in the model where generated text lacks faithfulness to the source.
We formally define hallucination and propose a framework for its quantitative detection in a zero-shot setting.
In detecting hallucinations, our solution achieves an accuracy of 0.78 in a model-aware setting and 0.61 in a model-agnostic setting.
- Score: 8.539639901976594
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent studies, the extensive utilization of large language models has underscored the importance of robust evaluation methodologies for assessing text generation quality and relevance to specific tasks. This has revealed a prevalent issue known as hallucination, an emergent condition in the model where generated text lacks faithfulness to the source and deviates from the evaluation criteria. In this study, we formally define hallucination and propose a framework for its quantitative detection in a zero-shot setting, leveraging our definition and the assumption that model outputs entail task and sample specific inputs. In detecting hallucinations, our solution achieves an accuracy of 0.78 in a model-aware setting and 0.61 in a model-agnostic setting. Notably, our solution maintains computational efficiency, requiring far less computational resources than other SOTA approaches, aligning with the trend towards lightweight and compressed models.
Related papers
- VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - Revisiting Sample Size Determination in Natural Language Understanding [18.637079595450366]
Knowing exactly how many data points need to be labeled to achieve a certain model performance is a beneficial step towards reducing the overall budgets for annotation.
We derived a simple yet effective approach to predict the maximum achievable model performance based on small amount of training samples.
arXiv Detail & Related papers (2023-07-01T16:08:52Z) - Probing of Quantitative Values in Abstractive Summarization Models [0.0]
We evaluate the efficacy of abstract summarization models' modeling of quantitative values found in the input text.
Our results show that in most cases, the encoders of recent SOTA-performing models struggle to provide embeddings that adequately represent quantitative values.
arXiv Detail & Related papers (2022-10-03T00:59:50Z) - Fake It Till You Make It: Near-Distribution Novelty Detection by
Score-Based Generative Models [54.182955830194445]
existing models either fail or face a dramatic drop under the so-called near-distribution" setting.
We propose to exploit a score-based generative model to produce synthetic near-distribution anomalous data.
Our method improves the near-distribution novelty detection by 6% and passes the state-of-the-art by 1% to 5% across nine novelty detection benchmarks.
arXiv Detail & Related papers (2022-05-28T02:02:53Z) - Self-Normalized Importance Sampling for Neural Language Modeling [97.96857871187052]
In this work, we propose self-normalized importance sampling. Compared to our previous work, the criteria considered in this work are self-normalized and there is no need to further conduct a correction step.
We show that our proposed self-normalized importance sampling is competitive in both research-oriented and production-oriented automatic speech recognition tasks.
arXiv Detail & Related papers (2021-11-11T16:57:53Z) - Improving Faithfulness in Abstractive Summarization with Contrast
Candidate Generation and Selection [54.38512834521367]
We study contrast candidate generation and selection as a model-agnostic post-processing technique.
We learn a discriminative correction model by generating alternative candidate summaries.
This model is then used to select the best candidate as the final output summary.
arXiv Detail & Related papers (2021-04-19T05:39:24Z) - How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating
and Auditing Generative Models [95.8037674226622]
We introduce a 3-dimensional evaluation metric that characterizes the fidelity, diversity and generalization performance of any generative model in a domain-agnostic fashion.
Our metric unifies statistical divergence measures with precision-recall analysis, enabling sample- and distribution-level diagnoses of model fidelity and diversity.
arXiv Detail & Related papers (2021-02-17T18:25:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.