Related papers: BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text

BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text

URL: http://arxiv.org/abs/2504.19467v2
Date: Thu, 01 May 2025 02:21:09 GMT
Title: BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text
Authors: Jiageng Wu, Bowen Gu, Ren Zhou, Kevin Xie, Doug Snyder, Yixing Jiang, Valentina Carducci, Richard Wyss, Rishi J Desai, Emily Alsentzer, Leo Anthony Celi, Adam Rodman, Sebastian Schneeweiss, Jonathan H. Chen, Santiago Romero-Brufau, Kueiyu Joshua Lin, Jie Yang,
Abstract summary: Large language models (LLMs) hold great promise for medical applications and are evolving rapidly.<n>Most existing benchmarks rely on medical exam-style questions or PubMed-derived text.<n>We present BRIDGE, a comprehensive benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages.
Score: 10.071956824618418
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, current evaluations of LLMs in clinical contexts remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world electronic health record (EHR) data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. We systematically evaluated 52 state-of-the-art LLMs (including DeepSeek-R1, GPT-4o, Gemini, and Llama 4) under various inference strategies. With a total of 13,572 experiments, our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding. The BRIDGE leaderboard: https://huggingface.co/spaces/YLab-Open/BRIDGE-Medical-Leaderboard

Related papers

Large Language Models are Powerful EHR Encoders [4.520903886487343]
Domain-specific EHR foundation models have demonstrated promising improvements in predictive accuracy and generalization.<n>We explore the possibility of using general-purpose Large Language Models (LLMs) based embedding methods as EHR encoders.<n>We evaluate two state-of-the-art LLM-embedding models, GTE-Qwen2-7B-Instruct and LLM2Vec-Llama3.1-8B-Instruct, across 15 diverse clinical prediction tasks.
arXiv Detail & Related papers (2025-02-24T18:30:36Z)
Leveraging Large Language Models for Medical Information Extraction and Query Generation [2.1793134762413433]
This paper introduces a system that integrates large language models (LLMs) into the clinical trial retrieval process. We evaluate six LLMs for query generation, focusing on open-source and relatively small models that require minimal computational resources.
arXiv Detail & Related papers (2024-10-31T12:01:51Z)
MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications [2.838746648891565]
We introduce MEDIC, a framework assessing Large Language Models (LLMs) across five critical dimensions of clinical competence. We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks. Results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths.
arXiv Detail & Related papers (2024-09-11T14:44:51Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks [22.539696532725607]
Large Language Models (LLMs) are increasingly deployed in medicine.<n>However, their utility in non-generative clinical prediction remains under-evaluated.<n>Our ClinicRealm study addresses this by benchmarking 9 GPT-based LLMs, 5 BERT-based models, and 7 traditional methods on unstructured clinical notes and structured Electronic Health Records.
arXiv Detail & Related papers (2024-07-26T06:09:10Z)
Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. We then construct six novel datasets and clinical tasks that are complex but common in real-world practice. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z)
Does Biomedical Training Lead to Better Medical Performance? [2.3814275542331385]
Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. This study investigates the effect of biomedical training in the context of six practical medical tasks evaluating $25$ models.
arXiv Detail & Related papers (2024-04-05T12:51:37Z)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians. Recent studies have achieved promising results in automatic impression generation using large-scale medical text data. These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z)
Do We Still Need Clinical Language Models? [15.023633270864675]
We show that relatively small specialized clinical models substantially outperform all in-context learning approaches. We release the code and the models used under the PhysioNet Credentialed Health Data license and data use agreement.
arXiv Detail & Related papers (2023-02-16T05:08:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.