Related papers: Appraising the Potential Uses and Harms of LLMs for Medical Systematic Reviews

Appraising the Potential Uses and Harms of LLMs for Medical Systematic Reviews

URL: http://arxiv.org/abs/2305.11828v3
Date: Wed, 18 Oct 2023 13:54:15 GMT
Title: Appraising the Potential Uses and Harms of LLMs for Medical Systematic Reviews
Authors: Hye Sun Yun, Iain J. Marshall, Thomas A. Trikalinos, Byron C. Wallace
Abstract summary: Large language models (LLMs) offer potential to automatically generate literature reviews on demand. LLMs sometimes generate inaccurate (and potentially misleading) texts by hallucination or omission.
Score: 21.546144601311187
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Medical systematic reviews play a vital role in healthcare decision making and policy. However, their production is time-consuming, limiting the availability of high-quality and up-to-date evidence summaries. Recent advancements in large language models (LLMs) offer the potential to automatically generate literature reviews on demand, addressing this issue. However, LLMs sometimes generate inaccurate (and potentially misleading) texts by hallucination or omission. In healthcare, this can make LLMs unusable at best and dangerous at worst. We conducted 16 interviews with international systematic review experts to characterize the perceived utility and risks of LLMs in the specific context of medical evidence reviews. Experts indicated that LLMs can assist in the writing process by drafting summaries, generating templates, distilling information, and crosschecking information. They also raised concerns regarding confidently composed but inaccurate LLM outputs and other potential downstream harms, including decreased accountability and proliferation of low-quality reviews. Informed by this qualitative analysis, we identify criteria for rigorous evaluation of biomedical LLMs aligned with domain expert views.

Related papers

Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns. Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance. We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z)
Automatically Evaluating the Paper Reviewing Capability of Large Language Models [46.0003776499898]
Large Language Models (LLMs) show potential for providing assistance, but research has reported significant limitations in the reviews they generate. We developed an automatic evaluation pipeline to assess the LLMs' paper review capability by comparing them with expert-generated reviews.
arXiv Detail & Related papers (2025-02-24T12:05:27Z)
Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation [0.5070610131852027]
Large language models (LLMs) can be effectively misused for generating disinformation news articles. This study fills this gap by evaluation of vulnerabilities of recent open and closed LLMs. Our results demonstrate the need for stronger safety-filters and disclaimers.
arXiv Detail & Related papers (2024-12-18T09:48:53Z)
Are We There Yet? Revealing the Risks of Utilizing Large Language Models in Scholarly Peer Review [66.73247554182376]
Large language models (LLMs) have led to their integration into peer review. The unchecked adoption of LLMs poses significant risks to the integrity of the peer review system. We show that manipulating 5% of the reviews could potentially cause 12% of the papers to lose their position in the top 30% rankings.
arXiv Detail & Related papers (2024-12-02T16:55:03Z)
Truth or Mirage? Towards End-to-End Factuality Evaluation with LLM-Oasis [78.07225438556203]
We introduce LLM-Oasis, the largest resource for training end-to-end factuality evaluators. It is constructed by extracting claims from Wikipedia, falsifying a subset of these claims, and generating pairs of factual and unfactual texts. We then rely on human annotators to both validate the quality of our dataset and to create a gold standard test set for factuality evaluation systems.
arXiv Detail & Related papers (2024-11-29T12:21:15Z)
Overview of TREC 2024 Biomedical Generative Retrieval (BioGen) Track [18.3893773380282]
hallucinations or confabulations remain one of the key challenges when using large language models (LLMs) in the biomedical domain. Inaccuracies may be particularly harmful in high-risk situations, such as medical question answering, making clinical decisions, or appraising biomedical research.
arXiv Detail & Related papers (2024-11-27T05:43:00Z)
Usefulness of LLMs as an Author Checklist Assistant for Scientific Papers: NeurIPS'24 Experiment [59.09144776166979]
Large language models (LLMs) represent a promising, but controversial, tool in aiding scientific peer review. This study evaluates the usefulness of LLMs in a conference setting as a tool for vetting paper submissions against submission standards.
arXiv Detail & Related papers (2024-11-05T18:58:00Z)
The Potential of LLMs in Medical Education: Generating Questions and Answers for Qualification Exams [9.802579169561781]
Large language models (LLMs) can generate medical qualification exam questions and corresponding answers based on few-shot prompts. The study found that LLMs, after using few-shot prompts, can effectively mimic real-world medical qualification exam questions.
arXiv Detail & Related papers (2024-10-31T09:33:37Z)
Reliable and diverse evaluation of LLM medical knowledge mastery [6.825565574784612]
We propose a novel framework that generates reliable and diverse test samples to evaluate medical-specific LLMs. We use our proposed framework to systematically investigate the mastery of medical factual knowledge of 12 well-known LLMs.
arXiv Detail & Related papers (2024-09-22T03:13:38Z)
LLM Internal States Reveal Hallucination Risk Faced With a Query [62.29558761326031]
Humans have a self-awareness process that allows us to recognize what we don't know when faced with queries. This paper investigates whether Large Language Models can estimate their own hallucination risk before response generation. By a probing estimator, we leverage LLM self-assessment, achieving an average hallucination estimation accuracy of 84.32% at run time.
arXiv Detail & Related papers (2024-07-03T17:08:52Z)
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing [106.45895712717612]
Large language models (LLMs) have shown remarkable versatility in various generative tasks. This study focuses on the topic of LLMs assist NLP Researchers. To our knowledge, this is the first work to provide such a comprehensive analysis.
arXiv Detail & Related papers (2024-06-24T01:30:22Z)
CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models [60.59638232596912]
We introduce CLAMBER, a benchmark for evaluating large language models (LLMs) Building upon the taxonomy, we construct 12K high-quality data to assess the strengths, weaknesses, and potential risks of various off-the-shelf LLMs. Our findings indicate the limited practical utility of current LLMs in identifying and clarifying ambiguous user queries.
arXiv Detail & Related papers (2024-05-20T14:34:01Z)
How well do LLMs cite relevant medical references? An evaluation framework and analyses [18.1921791355309]
Large language models (LLMs) are currently being used to answer medical questions across a variety of clinical domains. In this paper, we ask: do the sources that LLMs generate actually support the claims that they make? We demonstrate that GPT-4 is highly accurate in validating source relevance, agreeing 88% of the time with a panel of medical doctors.
arXiv Detail & Related papers (2024-02-03T03:44:57Z)
FFT: Towards Harmlessness Evaluation and Analysis for LLMs with Factuality, Fairness, Toxicity [21.539026782010573]
The widespread of generative artificial intelligence has heightened concerns about the potential harms posed by AI-generated texts. Previous researchers have invested much effort in assessing the harmlessness of generative language models.
arXiv Detail & Related papers (2023-11-30T14:18:47Z)
Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review [16.008511195589925]
Large language models (LLMs) have shown promising capabilities in mimicking human-level language comprehension and reasoning. This paper provides a comprehensive review on the applications and implications of LLMs in medicine.
arXiv Detail & Related papers (2023-11-03T13:51:36Z)
Survey on Factuality in Large Language Models: Knowledge, Retrieval and Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs) As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z)
Medical Misinformation in AI-Assisted Self-Diagnosis: Development of a Method (EvalPrompt) for Analyzing Large Language Models [4.8775268199830935]
This study aims to assess the effectiveness of large language models (LLMs) as a self-diagnostic tool and their role in spreading healthcare misinformation. We use open-ended questions to mimic real-world self-diagnosis use cases, and perform sentence dropout to mimic realistic self-diagnosis with missing information. The results highlight the modest capabilities of LLMs, as their responses are often unclear and inaccurate.
arXiv Detail & Related papers (2023-07-10T21:28:26Z)
Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning. They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.