Appraising the Potential Uses and Harms of LLMs for Medical Systematic
Reviews
- URL: http://arxiv.org/abs/2305.11828v3
- Date: Wed, 18 Oct 2023 13:54:15 GMT
- Title: Appraising the Potential Uses and Harms of LLMs for Medical Systematic
Reviews
- Authors: Hye Sun Yun, Iain J. Marshall, Thomas A. Trikalinos, Byron C. Wallace
- Abstract summary: Large language models (LLMs) offer potential to automatically generate literature reviews on demand.
LLMs sometimes generate inaccurate (and potentially misleading) texts by hallucination or omission.
- Score: 21.546144601311187
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Medical systematic reviews play a vital role in healthcare decision making
and policy. However, their production is time-consuming, limiting the
availability of high-quality and up-to-date evidence summaries. Recent
advancements in large language models (LLMs) offer the potential to
automatically generate literature reviews on demand, addressing this issue.
However, LLMs sometimes generate inaccurate (and potentially misleading) texts
by hallucination or omission. In healthcare, this can make LLMs unusable at
best and dangerous at worst. We conducted 16 interviews with international
systematic review experts to characterize the perceived utility and risks of
LLMs in the specific context of medical evidence reviews. Experts indicated
that LLMs can assist in the writing process by drafting summaries, generating
templates, distilling information, and crosschecking information. They also
raised concerns regarding confidently composed but inaccurate LLM outputs and
other potential downstream harms, including decreased accountability and
proliferation of low-quality reviews. Informed by this qualitative analysis, we
identify criteria for rigorous evaluation of biomedical LLMs aligned with
domain expert views.
Related papers
- Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored.
Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities.
We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z) - Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation [0.5070610131852027]
Large language models (LLMs) can be effectively misused for generating disinformation news articles.
This study fills this gap by evaluation of vulnerabilities of recent open and closed LLMs.
Our results demonstrate the need for stronger safety-filters and disclaimers.
arXiv Detail & Related papers (2024-12-18T09:48:53Z) - Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review.
The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system.
We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z) - Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators.
It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts.
We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z) - Overview of TREC 2024 Biomedical Generative Retrieval (BioGen) Track [18.3893773380282]
hallucinations or confabulations remain one of the key challenges when using large language models (LLMs) in the biomedical domain.
Inaccuracies may be particularly harmful in high-risk situations, such as medical question answering, making clinical decisions, or appraising biomedical research.
arXiv Detail & Related papers (2024-11-27T05:43:00Z) - Reliable and diverse evaluation of LLM medical knowledge mastery [6.825565574784612]
We propose a novel framework that generates reliable and diverse test samples to evaluate medical-specific LLMs.
We use our proposed framework to systematically investigate the mastery of medical factual knowledge of 12 well-known LLMs.
arXiv Detail & Related papers (2024-09-22T03:13:38Z) - LLM Internal States Reveal Hallucination Risk Faced With a Query [62.29558761326031]
Humans have a self-awareness process that allows us to recognize what we don't know when faced with queries.
This paper investigates whether Large Language Models can estimate their own hallucination risk before response generation.
By a probing estimator, we leverage LLM self-assessment, achieving an average hallucination estimation accuracy of 84.32% at run time.
arXiv Detail & Related papers (2024-07-03T17:08:52Z) - LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks.
This study focuses on the topic of LLMs assist NLP Researchers.
To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z) - FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity [20.510512358961517]
The widespread of generative artificial intelligence has heightened concerns about the potential harms posed by AI-generated texts.
Previous researchers have invested much effort in assessing the harmlessness of generative language models.
arXiv Detail & Related papers (2023-11-30T14:18:47Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.