A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry
- URL: http://arxiv.org/abs/2404.15777v4
- Date: Wed, 29 May 2024 15:50:43 GMT
- Title: A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry
- Authors: Yining Huang, Keke Tang, Meilian Chen, Boyuan Wang,
- Abstract summary: Large Language Models (LLMs) have evolved significantly, impacting various industries with their advanced capabilities in language understanding and generation.
This comprehensive survey delineates the extensive application and requisite evaluation of LLMs within healthcare.
Our survey is structured to provide an in-depth analysis of LLM applications across clinical settings, medical text data processing, research, education, and public health awareness.
- Score: 2.1717945745027425
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since the inception of the Transformer architecture in 2017, Large Language Models (LLMs) such as GPT and BERT have evolved significantly, impacting various industries with their advanced capabilities in language understanding and generation. These models have shown potential to transform the medical field, highlighting the necessity for specialized evaluation frameworks to ensure their effective and ethical deployment. This comprehensive survey delineates the extensive application and requisite evaluation of LLMs within healthcare, emphasizing the critical need for empirical validation to fully exploit their capabilities in enhancing healthcare outcomes. Our survey is structured to provide an in-depth analysis of LLM applications across clinical settings, medical text data processing, research, education, and public health awareness. We begin by exploring the roles of LLMs in various medical applications, detailing their evaluation based on performance in tasks such as clinical diagnosis, medical text data processing, information retrieval, data analysis, and educational content generation. The subsequent sections offer a comprehensive discussion on the evaluation methods and metrics employed, including models, evaluators, and comparative experiments. We further examine the benchmarks and datasets utilized in these evaluations, providing a categorized description of benchmarks for tasks like question answering, summarization, information extraction, bioinformatics, information retrieval and general comprehensive benchmarks. This structure ensures a thorough understanding of how LLMs are assessed for their effectiveness, accuracy, usability, and ethical alignment in the medical domain. ...
Related papers
- A Survey of Medical Vision-and-Language Applications and Their Techniques [48.268198631277315]
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data.
Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied.
We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics.
arXiv Detail & Related papers (2024-11-19T03:27:05Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - Large Language Model Benchmarks in Medical Tasks [11.196196955468992]
This paper presents a survey of various benchmark datasets employed in medical large language models (LLMs) tasks.
The survey categorizes the datasets by modality, discussing their significance, data structure, and impact on the development of LLMs.
The paper emphasizes the need for datasets with a greater degree of language diversity, structured omics data, and innovative approaches to synthesis.
arXiv Detail & Related papers (2024-10-28T11:07:33Z) - Demystifying Large Language Models for Medicine: A Primer [50.83806796466396]
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare.
This tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice.
arXiv Detail & Related papers (2024-10-24T15:41:56Z) - MEDIC: Towards a Comprehensive Framework for Evaluating LLMs in Clinical Applications [2.838746648891565]
We introduce MEDIC, a framework assessing Large Language Models (LLMs) across five critical dimensions of clinical competence.
We apply MEDIC to evaluate LLMs on medical question-answering, safety, summarization, note generation, and other tasks.
Results show performance disparities across model sizes, baseline vs medically finetuned models, and have implications on model selection for applications requiring specific model strengths.
arXiv Detail & Related papers (2024-09-11T14:44:51Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - Evaluating large language models in medical applications: a survey [1.5923327069574245]
Large language models (LLMs) have emerged as powerful tools with transformative potential across numerous domains.
evaluating the performance of LLMs in medical contexts presents unique challenges due to the complex and critical nature of medical information.
arXiv Detail & Related papers (2024-05-13T05:08:33Z) - Attribute Structuring Improves LLM-Based Evaluation of Clinical Text
Summaries [62.32403630651586]
Large language models (LLMs) have shown the potential to generate accurate clinical text summaries, but still struggle with issues regarding grounding and evaluation.
Here, we explore a general mitigation framework using Attribute Structuring (AS), which structures the summary evaluation process.
AS consistently improves the correspondence between human annotations and automated metrics in clinical text summarization.
arXiv Detail & Related papers (2024-03-01T21:59:03Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - Large Language Models Illuminate a Progressive Pathway to Artificial
Healthcare Assistant: A Review [16.008511195589925]
Large language models (LLMs) have shown promising capabilities in mimicking human-level language comprehension and reasoning.
This paper provides a comprehensive review on the applications and implications of LLMs in medicine.
arXiv Detail & Related papers (2023-11-03T13:51:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.