Related papers: Introducing Answered with Evidence -- a framework for evaluating whether LLM responses to biomedical questions are founded in evidence

Introducing Answered with Evidence -- a framework for evaluating whether LLM responses to biomedical questions are founded in evidence

URL: http://arxiv.org/abs/2507.02975v1
Date: Mon, 30 Jun 2025 18:00:52 GMT
Title: Introducing Answered with Evidence -- a framework for evaluating whether LLM responses to biomedical questions are founded in evidence
Authors: Julian D Baldwin, Christina Dinh, Arjun Mukerji, Neil Sanghavi, Saurabh Gombar,
Abstract summary: Large language models (LLMs) for biomedical question answering raise concerns about the accuracy and evidentiary support of their responses.<n>We analyzed thousands of physician-submitted questions using a comparative pipeline that included: (1) Alexandria, fka the Atropos Evidence Library, a retrieval-augmented generation (RAG) system based on novel observational studies, and (2) two PubMed-based retrieval-augmented systems (System and Perplexity)<n>We found that PubMed-based systems provided evidence-supported answers for approximately 44% of questions, while the novel evidence source did so for about 50%.
Score: 1.3250161978024673
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The growing use of large language models (LLMs) for biomedical question answering raises concerns about the accuracy and evidentiary support of their responses. To address this, we present Answered with Evidence, a framework for evaluating whether LLM-generated answers are grounded in scientific literature. We analyzed thousands of physician-submitted questions using a comparative pipeline that included: (1) Alexandria, fka the Atropos Evidence Library, a retrieval-augmented generation (RAG) system based on novel observational studies, and (2) two PubMed-based retrieval-augmented systems (System and Perplexity). We found that PubMed-based systems provided evidence-supported answers for approximately 44% of questions, while the novel evidence source did so for about 50%. Combined, these sources enabled reliable answers to over 70% of biomedical queries. As LLMs become increasingly capable of summarizing scientific content, maximizing their value will require systems that can accurately retrieve both published and custom-generated evidence or generate such evidence in real time.

Related papers

Enhancing LLM Generation with Knowledge Hypergraph for Evidence-Based Medicine [22.983780823136925]
Evidence-based medicine (EBM) plays a crucial role in the application of large language models (LLMs) in healthcare.<n>We propose using LLMs to gather scattered evidence from multiple sources and present a knowledge hypergraph-based evidence management model.<n>Our approach outperforms existing RAG techniques in application domains of interest to EBM, such as medical quizzing, hallucination detection, and decision support.
arXiv Detail & Related papers (2025-03-18T09:17:31Z)
Overview of TREC 2024 Biomedical Generative Retrieval (BioGen) Track [18.3893773380282]
hallucinations or confabulations remain one of the key challenges when using large language models (LLMs) in the biomedical domain.<n>Inaccuracies may be particularly harmful in high-risk situations, such as medical question answering, making clinical decisions, or appraising biomedical research.
arXiv Detail & Related papers (2024-11-27T05:43:00Z)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z)
ELOQ: Resources for Enhancing LLM Detection of Out-of-Scope Questions [52.33835101586687]
We study out-of-scope questions, where the retrieved document appears semantically similar to the question but lacks the necessary information to answer it.<n>We propose a guided hallucination-based approach ELOQ to automatically generate a diverse set of out-of-scope questions from post-cutoff documents.
arXiv Detail & Related papers (2024-10-18T16:11:29Z)
Explainable Biomedical Hypothesis Generation via Retrieval Augmented Generation enabled Large Language Models [46.05020842978823]
Large Language Models (LLMs) have emerged as powerful tools to navigate this complex data landscape. RAGGED is a comprehensive workflow designed to support investigators with knowledge integration and hypothesis generation.
arXiv Detail & Related papers (2024-07-17T07:44:18Z)
How do you know that? Teaching Generative Language Models to Reference Answers to Biomedical Questions [0.0]
Large language models (LLMs) have recently become the leading source of answers for users' questions online. Despite their ability to offer eloquent answers, their accuracy and reliability can pose a significant challenge. This paper introduces a biomedical retrieval-augmented generation (RAG) system designed to enhance the reliability of generated responses.
arXiv Detail & Related papers (2024-07-06T09:10:05Z)
Answering real-world clinical questions using large language model based systems [2.2605659089865355]
Large language models (LLMs) could potentially address both challenges by either summarizing published literature or generating new studies based on real-world data (RWD) We evaluated the ability of five LLM-based systems in answering 50 clinical questions and had nine independent physicians review the responses for relevance, reliability, and actionability.
arXiv Detail & Related papers (2024-06-29T22:39:20Z)
De-identification of clinical free text using natural language processing: A systematic review of current approaches [48.343430343213896]
Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process. Our study aims to provide systematic evidence on how the de-identification of clinical free text has evolved in the last thirteen years.
arXiv Detail & Related papers (2023-11-28T13:20:41Z)
Clinfo.ai: An Open-Source Retrieval-Augmented Large Language Model System for Answering Medical Questions using Scientific Literature [44.715854387549605]
We release Clinfo.ai, an open-source WebApp that answers clinical questions based on dynamically retrieved scientific literature. We report benchmark results for Clinfo.ai and other publicly available OpenQA systems on PubMedRS-200.
arXiv Detail & Related papers (2023-10-24T19:43:39Z)
Generating Explanations in Medical Question-Answering by Expectation Maximization Inference over Evidence [33.018873142559286]
We propose a novel approach for generating natural language explanations for answers predicted by medical QA systems. Our system extract knowledge from medical textbooks to enhance the quality of explanations during the explanation generation process.
arXiv Detail & Related papers (2023-10-02T16:00:37Z)
Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)<n>LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.<n>We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z)
Medical Question Understanding and Answering with Knowledge Grounding and Semantic Self-Supervision [53.692793122749414]
We introduce a medical question understanding and answering system with knowledge grounding and semantic self-supervision. Our system is a pipeline that first summarizes a long, medical, user-written question, using a supervised summarization loss. The system first matches the summarized user question with an FAQ from a trusted medical knowledge base, and then retrieves a fixed number of relevant sentences from the corresponding answer document.
arXiv Detail & Related papers (2022-09-30T08:20:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.