Related papers: Overview of TREC 2024 Biomedical Generative Retrieval (BioGen) Track

Overview of TREC 2024 Biomedical Generative Retrieval (BioGen) Track

URL: http://arxiv.org/abs/2411.18069v2
Date: Sat, 14 Dec 2024 05:56:10 GMT
Title: Overview of TREC 2024 Biomedical Generative Retrieval (BioGen) Track
Authors: Deepak Gupta, Dina Demner-Fushman, William Hersh, Steven Bedrick, Kirk Roberts,
Abstract summary: hallucinations or confabulations remain one of the key challenges when using large language models (LLMs) in the biomedical domain.<n>Inaccuracies may be particularly harmful in high-risk situations, such as medical question answering, making clinical decisions, or appraising biomedical research.
Score: 18.3893773380282
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: With the advancement of large language models (LLMs), the biomedical domain has seen significant progress and improvement in multiple tasks such as biomedical question answering, lay language summarization of the biomedical literature, clinical note summarization, etc. However, hallucinations or confabulations remain one of the key challenges when using LLMs in the biomedical and other domains. Inaccuracies may be particularly harmful in high-risk situations, such as medical question answering, making clinical decisions, or appraising biomedical research. Studies on the evaluation of the LLMs abilities to ground generated statements in verifiable sources have shown that models perform significantly worse on lay-user-generated questions, and often fail to reference relevant sources. This can be problematic when those seeking information want evidence from studies to back up the claims from LLMs. Unsupported statements are a major barrier to using LLMs in any applications that may affect health. Methods for grounding generated statements in reliable sources along with practical evaluation approaches are needed to overcome this barrier. Towards this, in our pilot task organized at TREC 2024, we introduced the task of reference attribution as a means to mitigate the generation of false statements by LLMs answering biomedical questions.

Related papers

Dr. GPT Will See You Now, but Should It? Exploring the Benefits and Harms of Large Language Models in Medical Diagnosis using Crowdsourced Clinical Cases [7.894865736540358]
Large Language Models (LLMs) are used in high-stakes applications such as medical (self-diagnosis) and preliminary triage.<n>This paper presents the findings from a university-level competition that leveraged a novel, crowdsourced approach for evaluating the effectiveness of LLMs.
arXiv Detail & Related papers (2025-06-13T17:12:47Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
ELOQ: Resources for Enhancing LLM Detection of Out-of-Scope Questions [52.33835101586687]
We study out-of-scope questions, where the retrieved document appears semantically similar to the question but lacks the necessary information to answer it.<n>We propose a guided hallucination-based approach ELOQ to automatically generate a diverse set of out-of-scope questions from post-cutoff documents.
arXiv Detail & Related papers (2024-10-18T16:11:29Z)
A Survey for Large Language Models in Biomedicine [31.719451674137844]
This review is based on an analysis of 484 publications sourced from databases including PubMed, Web of Science, and arXiv. We explore the capabilities of LLMs in zero-shot learning across a broad spectrum of biomedical tasks, including diagnostic assistance, drug discovery, and personalized medicine. We discuss the challenges that LLMs face in the biomedicine domain including data privacy concerns, limited model interpretability, issues with dataset quality, and ethics.
arXiv Detail & Related papers (2024-08-29T12:39:16Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering [5.065947993017158]
Large Language Models (LLMs) have demonstrated an impressive ability to encode knowledge during pre-training on large text corpora. We examine the capability of LLMs to exhibit medical knowledge recall by constructing a novel dataset derived from systematic reviews.
arXiv Detail & Related papers (2024-06-09T16:33:28Z)
An Evaluation of Large Language Models in Bioinformatics Research [52.100233156012756]
We study the performance of large language models (LLMs) on a wide spectrum of crucial bioinformatics tasks. These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems. Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks.
arXiv Detail & Related papers (2024-02-21T11:27:31Z)
Insights into Classifying and Mitigating LLMs' Hallucinations [48.04565928175536]
This paper delves into the underlying causes of AI hallucination and elucidates its significance in artificial intelligence. We explore potential strategies to mitigate hallucinations, aiming to enhance the overall reliability of large language models.
arXiv Detail & Related papers (2023-11-14T12:30:28Z)
Don't Ignore Dual Logic Ability of LLMs while Privatizing: A Data-Intensive Analysis in Medical Domain [19.46334739319516]
We study how the dual logic ability of LLMs is affected during the privatization process in the medical domain. Our results indicate that incorporating general domain dual logic data into LLMs not only enhances LLMs' dual logic ability but also improves their accuracy.
arXiv Detail & Related papers (2023-09-08T08:20:46Z)
Opportunities and Challenges for ChatGPT and Large Language Models in Biomedicine and Health [22.858424132819795]
ChatGPT has led to the emergence of diverse applications in the field of biomedicine and health. We explore the areas of biomedical information retrieval, question answering, medical text summarization, and medical education. We find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods.
arXiv Detail & Related papers (2023-06-15T20:19:08Z)
Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning. They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health. Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z)
On the Risk of Misinformation Pollution with Large Language Models [127.1107824751703]
We investigate the potential misuse of modern Large Language Models (LLMs) for generating credible-sounding misinformation. Our study reveals that LLMs can act as effective misinformation generators, leading to a significant degradation in the performance of Open-Domain Question Answering (ODQA) systems.
arXiv Detail & Related papers (2023-05-23T04:10:26Z)
Appraising the Potential Uses and Harms of LLMs for Medical Systematic Reviews [21.546144601311187]
Large language models (LLMs) offer potential to automatically generate literature reviews on demand. LLMs sometimes generate inaccurate (and potentially misleading) texts by hallucination or omission.
arXiv Detail & Related papers (2023-05-19T17:09:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.