Related papers: Investigating LLM Capabilities on Long Context Comprehension for Medical Question Answering

Investigating LLM Capabilities on Long Context Comprehension for Medical Question Answering

URL: http://arxiv.org/abs/2510.18691v1
Date: Tue, 21 Oct 2025 14:50:24 GMT
Title: Investigating LLM Capabilities on Long Context Comprehension for Medical Question Answering
Authors: Feras AlMannaa, Talia Tseriotou, Jenny Chim, Maria Liakata,
Abstract summary: This study is the first to investigate LLM comprehension capabilities over long-context (LC) medical QA of clinical relevance.<n>Our comprehensive assessment spans a range of content-inclusion settings based on relevance, LLM models of varying capabilities and datasets across task formulations.<n>We examine the effect of RAG on medical LC comprehension, uncover best settings in single versus multi-document reasoning datasets and showcase RAG strategies for improvements over LC.
Score: 11.557033367530053
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This study is the first to investigate LLM comprehension capabilities over long-context (LC) medical QA of clinical relevance. Our comprehensive assessment spans a range of content-inclusion settings based on their relevance, LLM models of varying capabilities and datasets across task formulations, revealing insights on model size effects, limitations, underlying memorization issues and the benefits of reasoning models. Importantly, we examine the effect of RAG on medical LC comprehension, uncover best settings in single versus multi-document reasoning datasets and showcase RAG strategies for improvements over LC. We shed light into some of the evaluation aspects using a multi-faceted approach. Our qualitative and error analyses address open questions on when RAG is beneficial over LC, revealing common failure cases.

Related papers

Demo: Guide-RAG: Evidence-Driven Corpus Curation for Retrieval-Augmented Generation in Long COVID [3.716027375029187]
We developed and evaluated six Retrieval-Augmented Generation (RAG) corpus configurations for Long COVID (LC) clinical question answering.<n>Our RAG corpus configuration combining clinical guidelines with high-quality systematic reviews consistently outperformed both narrow single-guideline approaches and large-scale literature databases.
arXiv Detail & Related papers (2025-10-17T16:05:52Z)
Evaluating LLMs Across Multi-Cognitive Levels: From Medical Knowledge Mastery to Scenario-Based Problem Solving [18.815592287807394]
Large language models (LLMs) have demonstrated remarkable performance on various medical benchmarks.<n>But their capabilities across different cognitive levels remain underexplored.<n>We propose a multi-cognitive-level evaluation framework for assessing LLMs in the medical domain.
arXiv Detail & Related papers (2025-06-10T02:07:33Z)
How do Large Language Models Understand Relevance? A Mechanistic Interpretability Perspective [64.00022624183781]
Large language models (LLMs) can assess relevance and support information retrieval (IR) tasks.<n>We investigate how different LLM modules contribute to relevance judgment through the lens of mechanistic interpretability.
arXiv Detail & Related papers (2025-04-10T16:14:55Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z)
Mitigating Hallucinations of Large Language Models in Medical Information Extraction via Contrastive Decoding [92.32881381717594]
We introduce ALternate Contrastive Decoding (ALCD) to solve hallucination issues in medical information extraction tasks. ALCD demonstrates significant improvements in resolving hallucination issues compared to conventional decoding methods.
arXiv Detail & Related papers (2024-10-21T07:19:19Z)
Understanding the Role of LLMs in Multimodal Evaluation Benchmarks [77.59035801244278]
This paper investigates the role of the Large Language Model (LLM) backbone in Multimodal Large Language Models (MLLMs) evaluation. Our study encompasses four diverse MLLM benchmarks and eight state-of-the-art MLLMs. Key findings reveal that some benchmarks allow high performance even without visual inputs and up to 50% of error rates can be attributed to insufficient world knowledge in the LLM backbone.
arXiv Detail & Related papers (2024-10-16T07:49:13Z)
Large Language Models for Disease Diagnosis: A Scoping Review [29.32421735974035]
Automatic disease diagnosis has become increasingly valuable in clinical practice.<n>The advent of large language models (LLMs) has catalyzed a paradigm shift in artificial intelligence.<n>Despite the increasing attention in this field, a holistic view is still lacking.
arXiv Detail & Related papers (2024-08-27T02:06:45Z)
Assessing and Enhancing Large Language Models in Rare Disease Question-answering [64.32570472692187]
We introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of Large Language Models (LLMs) in diagnosing rare diseases. We collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases. We then benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models. Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%.
arXiv Detail & Related papers (2024-08-15T21:09:09Z)
XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare [16.79952669254101]
We introduce a knowledge-guided in-context learning framework to enable large language models to process structured clinical data.<n>Our approach integrates domain-specific feature groupings, carefully balanced few-shot examples, and task-specific prompting strategies.
arXiv Detail & Related papers (2024-05-10T06:52:44Z)
EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries [9.031182965159976]
Large Language Models (LLMs) show promise in efficiently analyzing vast and complex data. We introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries. EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries.
arXiv Detail & Related papers (2024-02-25T09:41:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.