Large Language Models in Medical Term Classification and Unexpected
Misalignment Between Response and Reasoning
- URL: http://arxiv.org/abs/2312.14184v1
- Date: Tue, 19 Dec 2023 17:36:48 GMT
- Title: Large Language Models in Medical Term Classification and Unexpected
Misalignment Between Response and Reasoning
- Authors: Xiaodan Zhang, Sandeep Vemulapalli, Nabasmita Talukdar, Sumyeong Ahn,
Jiankun Wang, Han Meng, Sardar Mehtab Bin Murtaza, Aakash Ajay Dave, Dmitry
Leshchiner, Dimitri F. Joseph, Martin Witteveen-Lane, Dave Chesla, Jiayu
Zhou, and Bin Chen
- Abstract summary: This study assesses the ability of state-of-the-art large language models (LLMs) to identify patients with mild cognitive impairment (MCI) from discharge summaries.
The data was partitioned into training, validation, and testing sets in a 7:2:1 ratio for model fine-tuning and evaluation.
Open-source models like Falcon and LLaMA 2 achieved high accuracy but lacked explanatory reasoning.
- Score: 28.355000184014084
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This study assesses the ability of state-of-the-art large language models
(LLMs) including GPT-3.5, GPT-4, Falcon, and LLaMA 2 to identify patients with
mild cognitive impairment (MCI) from discharge summaries and examines instances
where the models' responses were misaligned with their reasoning. Utilizing the
MIMIC-IV v2.2 database, we focused on a cohort aged 65 and older, verifying MCI
diagnoses against ICD codes and expert evaluations. The data was partitioned
into training, validation, and testing sets in a 7:2:1 ratio for model
fine-tuning and evaluation, with an additional metastatic cancer dataset from
MIMIC III used to further assess reasoning consistency. GPT-4 demonstrated
superior interpretative capabilities, particularly in response to complex
prompts, yet displayed notable response-reasoning inconsistencies. In contrast,
open-source models like Falcon and LLaMA 2 achieved high accuracy but lacked
explanatory reasoning, underscoring the necessity for further research to
optimize both performance and interpretability. The study emphasizes the
significance of prompt engineering and the need for further exploration into
the unexpected reasoning-response misalignment observed in GPT-4. The results
underscore the promise of incorporating LLMs into healthcare diagnostics,
contingent upon methodological advancements to ensure accuracy and clinical
coherence of AI-generated outputs, thereby improving the trustworthiness of
LLMs for medical decision-making.
Related papers
- A Comparative Study of Recent Large Language Models on Generating Hospital Discharge Summaries for Lung Cancer Patients [19.777109737517996]
This research aims to explore how large language models (LLMs) can alleviate the burden of manual summarization.
This study evaluates the performance of multiple LLMs, including GPT-3.5, GPT-4, GPT-4o, and LLaMA 3 8b, in generating discharge summaries.
arXiv Detail & Related papers (2024-11-06T10:02:50Z) - Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare [0.2302001830524133]
Biased AI-generated medical advice and misdiagnoses can jeopardize patient safety.
This study introduces new resources designed to promote ethical and precise AI in healthcare.
arXiv Detail & Related papers (2024-10-09T06:00:05Z) - Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval [61.70489848327436]
KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning.
Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
arXiv Detail & Related papers (2024-10-06T18:46:28Z) - Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision.
This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z) - SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research [45.2233252981348]
Large Language Models have shown promising results in their ability to encode general medical knowledge.
We test the ability of state-of-the-art LLMs to leverage their internal knowledge and reasoning for epilepsy diagnosis.
arXiv Detail & Related papers (2024-07-03T11:02:12Z) - Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and
Symptom Analysis [2.4554686192257424]
Large language models (LLMs) constitute a breakthrough state-of-the-art Artificial Intelligence technology.
We evaluate the correctness and accuracy of LLM-generated medical diagnosis with publicly available multimodal multiple-choice questions.
We explored a wide range of diseases, conditions, chemical compounds, and related entity types that are included in the vast knowledge domain of Pathology.
arXiv Detail & Related papers (2024-01-28T09:25:12Z) - Evaluating the Fairness of the MIMIC-IV Dataset and a Baseline
Algorithm: Application to the ICU Length of Stay Prediction [65.268245109828]
This paper uses the MIMIC-IV dataset to examine the fairness and bias in an XGBoost binary classification model predicting the ICU length of stay.
The research reveals class imbalances in the dataset across demographic attributes and employs data preprocessing and feature extraction.
The paper concludes with recommendations for fairness-aware machine learning techniques for mitigating biases and the need for collaborative efforts among healthcare professionals and data scientists.
arXiv Detail & Related papers (2023-12-31T16:01:48Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Large Language Models for Healthcare Data Augmentation: An Example on
Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM)
Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z) - MIMO: Mutual Integration of Patient Journey and Medical Ontology for
Healthcare Representation Learning [49.57261599776167]
We propose an end-to-end robust Transformer-based solution, Mutual Integration of patient journey and Medical Ontology (MIMO) for healthcare representation learning and predictive analytics.
arXiv Detail & Related papers (2021-07-20T07:04:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.