Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and
Symptom Analysis
- URL: http://arxiv.org/abs/2402.01730v1
- Date: Sun, 28 Jan 2024 09:25:12 GMT
- Title: Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and
Symptom Analysis
- Authors: Dimitrios P. Panagoulias, Maria Virvou and George A. Tsihrintzis
- Abstract summary: Large language models (LLMs) constitute a breakthrough state-of-the-art Artificial Intelligence technology.
We evaluate the correctness and accuracy of LLM-generated medical diagnosis with publicly available multimodal multiple-choice questions.
We explored a wide range of diseases, conditions, chemical compounds, and related entity types that are included in the vast knowledge domain of Pathology.
- Score: 2.4554686192257424
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) constitute a breakthrough state-of-the-art
Artificial Intelligence technology which is rapidly evolving and promises to
aid in medical diagnosis. However, the correctness and the accuracy of their
returns has not yet been properly evaluated. In this work, we propose an LLM
evaluation paradigm that incorporates two independent steps of a novel
methodology, namely (1) multimodal LLM evaluation via structured interactions
and (2) follow-up, domain-specific analysis based on data extracted via the
previous interactions. Using this paradigm, (1) we evaluate the correctness and
accuracy of LLM-generated medical diagnosis with publicly available multimodal
multiple-choice questions(MCQs) in the domain of Pathology and (2) proceed to a
systemic and comprehensive analysis of extracted results. We used
GPT-4-Vision-Preview as the LLM to respond to complex, medical questions
consisting of both images and text, and we explored a wide range of diseases,
conditions, chemical compounds, and related entity types that are included in
the vast knowledge domain of Pathology. GPT-4-Vision-Preview performed quite
well, scoring approximately 84\% of correct diagnoses. Next, we further
analyzed the findings of our work, following an analytical approach which
included Image Metadata Analysis, Named Entity Recognition and Knowledge
Graphs. Weaknesses of GPT-4-Vision-Preview were revealed on specific knowledge
paths, leading to a further understanding of its shortcomings in specific
areas. Our methodology and findings are not limited to the use of
GPT-4-Vision-Preview, but a similar approach can be followed to evaluate the
usefulness and accuracy of other LLMs and, thus, improve their use with further
optimization.
Related papers
- Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools.
Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses.
We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z) - Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval [61.70489848327436]
KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning.
Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
arXiv Detail & Related papers (2024-10-06T18:46:28Z) - Large Language Models for Disease Diagnosis: A Scoping Review [29.498658795329977]
The advent of large language models (LLMs) has catalyzed a paradigm shift in artificial intelligence.
Despite the growing attention in this field, many critical research questions remain under-explored.
This scoping review examined the types of diseases, associated organ systems, relevant clinical data, LLM techniques, and evaluation methods reported in existing studies.
arXiv Detail & Related papers (2024-08-27T02:06:45Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - MedExQA: Medical Question Answering Benchmark with Multiple Explanations [2.2246416434538308]
This paper introduces MedExQA, a novel benchmark in medical question-answering to evaluate large language models' (LLMs) understanding of medical knowledge through explanations.
By constructing datasets across five distinct medical specialties, we address a major gap in current medical QA benchmarks.
Our work highlights the importance of explainability in medical LLMs, proposes an effective methodology for evaluating models beyond classification accuracy, and sheds light on one specific domain, speech language pathology.
arXiv Detail & Related papers (2024-06-10T14:47:04Z) - Evaluation of General Large Language Models in Contextually Assessing
Semantic Concepts Extracted from Adult Critical Care Electronic Health Record
Notes [17.648021186810663]
The purpose of this study was to evaluate the performance of Large Language Models (LLMs) in understanding and processing real-world clinical notes.
The GPT family models have demonstrated considerable efficiency, evidenced by their cost-effectiveness and time-saving capabilities.
arXiv Detail & Related papers (2024-01-24T16:52:37Z) - Large Language Models in Medical Term Classification and Unexpected
Misalignment Between Response and Reasoning [28.355000184014084]
This study assesses the ability of state-of-the-art large language models (LLMs) to identify patients with mild cognitive impairment (MCI) from discharge summaries.
The data was partitioned into training, validation, and testing sets in a 7:2:1 ratio for model fine-tuning and evaluation.
Open-source models like Falcon and LLaMA 2 achieved high accuracy but lacked explanatory reasoning.
arXiv Detail & Related papers (2023-12-19T17:36:48Z) - Ophtha-LLaMA2: A Large Language Model for Ophthalmology [31.39653268440651]
Large language models (LLMs) have achieved tremendous success in the field of Natural Language Processing (NLP)
In this study, we build an LLM termed the "Ophtha-LLaMA2" specifically tailored for ophthalmic disease diagnosis.
Inference test results show that even with a smaller fine-tuning dataset, Ophtha-LLaMA2 performs significantly better in ophthalmic diagnosis.
arXiv Detail & Related papers (2023-12-08T08:43:46Z) - Validating polyp and instrument segmentation methods in colonoscopy through Medico 2020 and MedAI 2021 Challenges [58.32937972322058]
"Medico automatic polyp segmentation (Medico 2020)" and "MedAI: Transparency in Medical Image (MedAI 2021)" competitions.
We present a comprehensive summary and analyze each contribution, highlight the strength of the best-performing methods, and discuss the possibility of clinical translations of such methods into the clinic.
arXiv Detail & Related papers (2023-07-30T16:08:45Z) - MIMO: Mutual Integration of Patient Journey and Medical Ontology for
Healthcare Representation Learning [49.57261599776167]
We propose an end-to-end robust Transformer-based solution, Mutual Integration of patient journey and Medical Ontology (MIMO) for healthcare representation learning and predictive analytics.
arXiv Detail & Related papers (2021-07-20T07:04:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.