Hallucination Detection in Large Language Models with Metamorphic Relations
- URL: http://arxiv.org/abs/2502.15844v2
- Date: Tue, 11 Mar 2025 18:28:18 GMT
- Title: Hallucination Detection in Large Language Models with Metamorphic Relations
- Authors: Borui Yang, Md Afif Al Mamun, Jie M. Zhang, Gias Uddin,
- Abstract summary: Large Language Models (LLMs) are prone to hallucinations, e.g., factually incorrect information, in their responses.<n>This paper presents MetaQA, a self-contained hallucination detection approach that leverages metamorphic relation and prompt mutation.<n>We compare MetaQA with the state-of-the-art zero-resource hallucination detection method, SelfCheckGPT, across multiple datasets.
- Score: 7.411154122932113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are prone to hallucinations, e.g., factually incorrect information, in their responses. These hallucinations present challenges for LLM-based applications that demand high factual accuracy. Existing hallucination detection methods primarily depend on external resources, which can suffer from issues such as low availability, incomplete coverage, privacy concerns, high latency, low reliability, and poor scalability. There are also methods depending on output probabilities, which are often inaccessible for closed-source LLMs like GPT models. This paper presents MetaQA, a self-contained hallucination detection approach that leverages metamorphic relation and prompt mutation. Unlike existing methods, MetaQA operates without any external resources and is compatible with both open-source and closed-source LLMs. MetaQA is based on the hypothesis that if an LLM's response is a hallucination, the designed metamorphic relations will be violated. We compare MetaQA with the state-of-the-art zero-resource hallucination detection method, SelfCheckGPT, across multiple datasets, and on two open-source and two closed-source LLMs. Our results reveal that MetaQA outperforms SelfCheckGPT in terms of precision, recall, and f1 score. For the four LLMs we study, MetaQA outperforms SelfCheckGPT with a superiority margin ranging from 0.041 - 0.113 (for precision), 0.143 - 0.430 (for recall), and 0.154 - 0.368 (for F1-score). For instance, with Mistral-7B, MetaQA achieves an average F1-score of 0.435, compared to SelfCheckGPT's F1-score of 0.205, representing an improvement rate of 112.2%. MetaQA also demonstrates superiority across all different categories of questions.
Related papers
- Lie to Me: Knowledge Graphs for Robust Hallucination Self-Detection in LLMs [0.0]
We examine the use of structured knowledge representations, namely knowledge graphs, to improve hallucination self-detection.<n>Our results show that LLMs can better analyse atomic facts when they are structured as knowledge graphs.<n>This low-cost, model-agnostic approach paves the way toward safer and more trustworthy language models.
arXiv Detail & Related papers (2025-12-29T15:41:13Z) - Optimizing Medical Question-Answering Systems: A Comparative Study of Fine-Tuned and Zero-Shot Large Language Models with RAG Framework [0.0]
We present a retrieval-augmented generation (RAG) based medical QA system that combines domain-specific knowledge retrieval with open-source LLMs to answer medical questions.<n>We fine-tune two state-of-the-art open LLMs (LLaMA2 and Falcon) using Low-Rank Adaptation (LoRA) for efficient domain specialization.<n>Our fine-tuned LLaMA2 model achieves 71.8% accuracy on PubMedQA, substantially improving over the 55.4% zero-shot baseline.
arXiv Detail & Related papers (2025-12-05T16:38:47Z) - RefineBench: Evaluating Refinement Capability of Language Models via Checklists [71.02281792867531]
We evaluate two refinement modes: guided refinement and self-refinement.<n>In guided refinement, both proprietary LMs and large open-weight LMs can leverage targeted feedback to refine responses to near-perfect levels within five turns.<n>These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses.
arXiv Detail & Related papers (2025-11-27T07:20:52Z) - Consistency Is the Key: Detecting Hallucinations in LLM Generated Text By Checking Inconsistencies About Key Facts [21.081815261690444]
Large language models (LLMs) often hallucinate and generate text that is factually incorrect and not grounded in real-world knowledge.<n>This poses serious risks in domains like healthcare, finance, and customer support.<n>We introduce CONFACTCHECK, an efficient detection approach that does not leverage any external knowledge base.
arXiv Detail & Related papers (2025-11-15T14:33:02Z) - CLUE: Non-parametric Verification from Experience via Hidden-State Clustering [64.50919789875233]
We show that correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations.<n>ClUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates.
arXiv Detail & Related papers (2025-10-02T02:14:33Z) - Fine-Grained Detection of Context-Grounded Hallucinations Using LLMs [16.173245551933178]
Context-grounded hallucinations are cases where model outputs contain information not verifiable against the source text.<n>We study the applicability of LLMs for localizing such hallucinations, as a more practical alternative to existing complex evaluation pipelines.
arXiv Detail & Related papers (2025-09-26T17:03:24Z) - Can LLMs Infer Personality from Real World Conversations? [5.705775078773656]
Large Language Models (LLMs) offer a promising approach for scalable personality assessment from open-ended language.<n>Three state-of-the-art LLMs were tested using zero-shot prompting for BFI-10 item prediction and both zero-shot and chain-of-thought prompting for Big Five trait inference.<n>All models showed high test-retest reliability, but construct validity was limited.
arXiv Detail & Related papers (2025-07-18T20:22:47Z) - Meta-Fair: AI-Assisted Fairness Testing of Large Language Models [2.9632404823837777]
Fairness is a core principle in the development of Artificial Intelligence (AI) systems.<n>Current approaches to fairness testing in large language models (LLMs) often rely on manual evaluation, fixed templates, deterministics, and curated datasets.<n>This work aims to lay the groundwork for a novel, automated method for testing fairness in LLMs.
arXiv Detail & Related papers (2025-07-03T11:20:59Z) - Can Multimodal Large Language Models Understand Spatial Relations? [16.76001474065412]
We introduce SpatialMQA, a human-annotated spatial relation reasoning benchmark based on COCO 2017.<n>Results indicate that the current state-of-the-art MLLM achieves only 48.14% accuracy, far below the human-level accuracy of 98.40%.
arXiv Detail & Related papers (2025-05-25T07:37:34Z) - Seeing What's Not There: Spurious Correlation in Multimodal LLMs [47.651861502104715]
We introduce SpurLens, a pipeline that automatically identifies spurious visual cues without human supervision.
Our findings reveal that spurious correlations cause two major failure modes in Multimodal Large Language Models (MLLMs)
By exposing the persistence of spurious correlations, our study calls for more rigorous evaluation methods and mitigation strategies to enhance the reliability of MLLMs.
arXiv Detail & Related papers (2025-03-11T20:53:00Z) - Uncertainty-Aware Fusion: An Ensemble Framework for Mitigating Hallucinations in Large Language Models [2.98260857963929]
Large Language Models (LLMs) are known to hallucinate and generate non-factual outputs which can undermine user trust.
Traditional methods to directly mitigate hallucinations, such as representation editing and contrastive decoding, often require additional training data and involve high implementation complexity.
We propose Uncertainty-Aware Fusion (UAF), an ensemble framework to reduce hallucinations by strategically combining multiple LLM based on their accuracy and self-assessment abilities.
arXiv Detail & Related papers (2025-02-22T10:48:18Z) - Verbosity $\neq$ Veracity: Demystify Verbosity Compensation Behavior of Large Language Models [8.846200844870767]
We discover an understudied type of undesirable behavior of Large Language Models (LLMs)<n>We term Verbosity Compensation (VC) as similar to the hesitation behavior of humans under uncertainty.<n>We propose a simple yet effective cascade algorithm that replaces verbose responses with the other model-generated responses.
arXiv Detail & Related papers (2024-11-12T15:15:20Z) - LLM Robustness Against Misinformation in Biomedical Question Answering [50.98256373698759]
The retrieval-augmented generation (RAG) approach is used to reduce the confabulation of large language models (LLMs) for question answering.
We evaluate the effectiveness and robustness of four LLMs against misinformation in answering biomedical questions.
arXiv Detail & Related papers (2024-10-27T16:23:26Z) - HypoTermQA: Hypothetical Terms Dataset for Benchmarking Hallucination
Tendency of LLMs [0.0]
Hallucinations pose a significant challenge to the reliability and alignment of Large Language Models (LLMs)
This paper introduces an automated scalable framework that combines benchmarking LLMs' hallucination tendencies with efficient hallucination detection.
The framework is domain-agnostic, allowing the use of any language model for benchmark creation or evaluation in any domain.
arXiv Detail & Related papers (2024-02-25T22:23:37Z) - "Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation [90.09260023184932]
Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations.
NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages.
We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z) - ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks [91.55895047448249]
This paper presents ReEval, an LLM-based framework using prompt chaining to perturb the original evidence for generating new test cases.
We implement ReEval using ChatGPT and evaluate the resulting variants of two popular open-domain QA datasets.
Our generated data is human-readable and useful to trigger hallucination in large language models.
arXiv Detail & Related papers (2023-10-19T06:37:32Z) - A New Benchmark and Reverse Validation Method for Passage-level
Hallucination Detection [63.56136319976554]
Large Language Models (LLMs) generate hallucinations, which can cause significant damage when deployed for mission-critical tasks.
We propose a self-check approach based on reverse validation to detect factual errors automatically in a zero-resource fashion.
We empirically evaluate our method and existing zero-resource detection methods on two datasets.
arXiv Detail & Related papers (2023-10-10T10:14:59Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.