MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models
- URL: http://arxiv.org/abs/2409.19492v2
- Date: Thu, 07 Aug 2025 02:13:10 GMT
- Title: MedHalu: Hallucinations in Responses to Healthcare Queries by Large Language Models
- Authors: Vibhor Agarwal, Yiqiao Jin, Mohit Chandra, Munmun De Choudhury, Srijan Kumar, Nishanth Sastry,
- Abstract summary: Large language models (LLMs) are starting to complement traditional information seeking mechanisms such as web search.<n>LLMs are prone to hallucinations, generating plausible yet factually incorrect or fabricated information.<n>This work conducts a pioneering study on hallucinations in LLM-generated responses to real-world healthcare queries from patients.
- Score: 26.464489158584463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are starting to complement traditional information seeking mechanisms such as web search. LLM-powered chatbots like ChatGPT are gaining prominence among the general public. AI chatbots are also increasingly producing content on social media platforms. However, LLMs are also prone to hallucinations, generating plausible yet factually incorrect or fabricated information. This becomes a critical problem when laypeople start seeking information about sensitive issues such as healthcare. Existing works in LLM hallucinations in the medical domain mainly focus on testing the medical knowledge of LLMs through standardized medical exam questions which are often well-defined and clear-cut with definitive answers. However, these approaches may not fully capture how these LLMs perform during real-world interactions with patients. This work conducts a pioneering study on hallucinations in LLM-generated responses to real-world healthcare queries from patients.We introduce MedHalu, a novel medical hallucination benchmark featuring diverse health-related topics and hallucinated responses from LLMs, with detailed annotation of the hallucination types and text spans. We also propose MedHaluDetect, a comprehensive framework for evaluating LLMs' abilities to detect hallucinations. Furthermore, we study the vulnerability to medical hallucinations among three groups -- medical experts, LLMs, and laypeople. Notably, LLMs significantly underperform human experts and, in some cases, even laypeople in detecting medical hallucinations. To improve hallucination detection, we propose an expert-in-the-loop approach that integrates expert reasoning into LLM inputs, significantly improving hallucination detection for all LLMs, including a 6.3% macro-F1 improvement for GPT-4.
Related papers
- Hallucinations Can Improve Large Language Models in Drug Discovery [10.573861741540853]
hallucinations in Large Language Models (LLMs) have been raised by researchers, yet their potential in areas where creativity is vital, such as drug discovery, merits exploration.
In this paper, we come up with the hypothesis that hallucinations can improve LLMs in drug discovery.
arXiv Detail & Related papers (2025-01-23T16:45:51Z) - Look Within, Why LLMs Hallucinate: A Causal Perspective [16.874588396996764]
Large language models (LLMs) are a milestone in generative artificial intelligence, achieving significant success in text comprehension and generation tasks.
LLMs suffer from severe hallucination problems, posing significant challenges to the practical applications of LLMs.
We propose a method to intervene in LLMs' self-attention layers and maintain their structures and sizes intact.
arXiv Detail & Related papers (2024-07-14T10:47:44Z) - Hallucination Detection: Robustly Discerning Reliable Answers in Large Language Models [70.19081534515371]
Large Language Models (LLMs) have gained widespread adoption in various natural language processing tasks.
They generate unfaithful or inconsistent content that deviates from the input source, leading to severe consequences.
We propose a robust discriminator named RelD to effectively detect hallucination in LLMs' generated answers.
arXiv Detail & Related papers (2024-07-04T18:47:42Z) - Mitigating Object Hallucination in MLLMs via Data-augmented Phrase-level Alignment [52.43197107069751]
Multimodal Large Language Models (MLLMs) often generate factually inaccurate information, referred to as hallucination.
We introduce Data-augmented Phrase-level Alignment (DPA), a novel loss which can be applied to instruction-tuned off-the-shelf MLLMs to mitigate hallucinations.
arXiv Detail & Related papers (2024-05-28T23:36:00Z) - Do LLMs Know about Hallucination? An Empirical Investigation of LLM's
Hidden States [19.343629282494774]
Large Language Models (LLMs) can make up answers that are not real, and this is known as hallucination.
This research aims to see if, how, and to what extent LLMs are aware of hallucination.
arXiv Detail & Related papers (2024-02-15T06:14:55Z) - Hallucination Detection and Hallucination Mitigation: An Investigation [13.941799495842776]
Large language models (LLMs) have achieved remarkable successes over the last two years in a range of different applications.
This report aims to present a comprehensive review of the current literature on both hallucination detection and hallucination mitigation.
arXiv Detail & Related papers (2024-01-16T13:36:07Z) - The Dawn After the Dark: An Empirical Study on Factuality Hallucination
in Large Language Models [134.6697160940223]
hallucination poses great challenge to trustworthy and reliable deployment of large language models.
Three key questions should be well studied: how to detect hallucinations (detection), why do LLMs hallucinate (source), and what can be done to mitigate them.
This work presents a systematic empirical study on LLM hallucination, focused on the the three aspects of hallucination detection, source and mitigation.
arXiv Detail & Related papers (2024-01-06T12:40:45Z) - A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions [40.79317187623401]
The emergence of large language models (LLMs) has marked a significant breakthrough in natural language processing (NLP)
LLMs are prone to hallucination, generating plausible yet nonfactual content.
This phenomenon raises significant concerns over the reliability of LLMs in real-world information retrieval systems.
arXiv Detail & Related papers (2023-11-09T09:25:37Z) - Analyzing and Mitigating Object Hallucination in Large Vision-Language Models [110.12460299261531]
Large vision-language models (LVLMs) have shown remarkable abilities in understanding visual information with human languages.
LVLMs still suffer from object hallucination, which is the problem of generating descriptions that include objects that do not actually exist in the images.
We propose a powerful algorithm, LVLM Hallucination Revisor (LURE), to rectify object hallucination in LVLMs by reconstructing less hallucinatory descriptions.
arXiv Detail & Related papers (2023-10-01T18:10:53Z) - Siren's Song in the AI Ocean: A Survey on Hallucination in Large
Language Models [116.01843550398183]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks.
LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge.
arXiv Detail & Related papers (2023-09-03T16:56:48Z) - Evaluation and Analysis of Hallucination in Large Vision-Language Models [49.19829480199372]
Large Vision-Language Models (LVLMs) have recently achieved remarkable success.
LVLMs are still plagued by the hallucination problem.
Hallucination refers to the information of LVLMs' responses that does not exist in the visual input.
arXiv Detail & Related papers (2023-08-29T08:51:24Z) - Halo: Estimation and Reduction of Hallucinations in Open-Source Weak
Large Language Models [11.497989461290793]
Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP)
Open-source LLMs with fewer parameters often suffer from severe hallucinations compared to their larger counterparts.
This paper focuses on measuring and reducing hallucinations in BLOOM 7B, a representative of such weaker open-source LLMs.
arXiv Detail & Related papers (2023-08-22T20:12:49Z) - HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large
Language Models [146.87696738011712]
Large language models (LLMs) are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge.
To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation benchmark for Large Language Models (HaluEval)
arXiv Detail & Related papers (2023-05-19T15:36:27Z) - Evaluating Object Hallucination in Large Vision-Language Models [122.40337582958453]
This work presents the first systematic study on object hallucination of large vision-language models (LVLMs)
We find that LVLMs tend to generate objects that are inconsistent with the target images in the descriptions.
We propose a polling-based query method called POPE to evaluate the object hallucination.
arXiv Detail & Related papers (2023-05-17T16:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.