Related papers: Beyond the Hype: Assessing the Performance, Trustworthiness, and Clinical Suitability of GPT3.5

Beyond the Hype: Assessing the Performance, Trustworthiness, and Clinical Suitability of GPT3.5

URL: http://arxiv.org/abs/2306.15887v1
Date: Wed, 28 Jun 2023 03:03:51 GMT
Title: Beyond the Hype: Assessing the Performance, Trustworthiness, and Clinical Suitability of GPT3.5
Authors: Salmonn Talebi, Elizabeth Tong and Mohammad R. K. Mofrad
Abstract summary: We present an approach to evaluate the performance and trustworthiness of a GPT3.5 model for medical image protocol assignment. We compare it with a fine-tuned BERT model and a radiologist. Our findings suggest that the GPT3.5 performance falls behind BERT and a radiologist.
Score: 0.37501702548174976
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The use of large language models (LLMs) in healthcare is gaining popularity, but their practicality and safety in clinical settings have not been thoroughly assessed. In high-stakes environments like medical settings, trust and safety are critical issues for LLMs. To address these concerns, we present an approach to evaluate the performance and trustworthiness of a GPT3.5 model for medical image protocol assignment. We compare it with a fine-tuned BERT model and a radiologist. In addition, we have a radiologist review the GPT3.5 output to evaluate its decision-making process. Our evaluation dataset consists of 4,700 physician entries across 11 imaging protocol classes spanning the entire head. Our findings suggest that the GPT3.5 performance falls behind BERT and a radiologist. However, GPT3.5 outperforms BERT in its ability to explain its decision, detect relevant word indicators, and model calibration. Furthermore, by analyzing the explanations of GPT3.5 for misclassifications, we reveal systematic errors that need to be resolved to enhance its safety and suitability for clinical use.

Related papers

Conformal Lesion Segmentation for 3D Medical Images [82.92159832699583]
We propose a risk-constrained framework that calibrates data-driven thresholds via conformalization to ensure the test-time FNR remains below a target tolerance.<n>We validate the statistical soundness and predictive performance of CLS on six 3D-LS datasets across five backbone models, and conclude with actionable insights for deploying risk-aware segmentation in clinical practice.
arXiv Detail & Related papers (2025-10-19T08:21:00Z)
Radiology's Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology [2.626353375402704]
General multimodal AI systems such as large language models (LLMs) and vision language models (VLMs) are increasingly accessed by clinicians and patients alike.<n>We developed a benchmark of 50 expert-level "spot diagnosis" cases across multiple imaging modalities.<n>We evaluated the performance of frontier AI models against board-certified radiologists and radiology trainees.
arXiv Detail & Related papers (2025-09-29T22:31:20Z)
Benchmarking GPT-5 in Radiation Oncology: Measurable Gains, but Persistent Need for Expert Oversight [1.0471566053937098]
GPT-5 is a large language model that has been specifically marketed towards oncology use.<n>On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%)<n>In the vignette evaluation, GPT-5's treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69)<n>While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation
arXiv Detail & Related papers (2025-08-29T16:55:25Z)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
LLMEval-Med: A Real-world Clinical Benchmark for Medical LLMs with Physician Validation [58.25892575437433]
evaluating large language models (LLMs) in medicine is crucial because medical applications require high accuracy with little room for error.<n>We present LLMEval-Med, a new benchmark covering five core medical areas, including 2,996 questions created from real-world electronic health records and expert-designed clinical scenarios.
arXiv Detail & Related papers (2025-06-04T15:43:14Z)
Relation Extraction Using Large Language Models: A Case Study on Acupuncture Point Locations [12.632106431145047]
Generative Pre-trained Transformers (GPT) present a significant opportunity for extracting relations related to acupoint locations. This study compares the performance of GPT with traditional deep learning models (LSTM) and Bidirectional Representations from Transformers for Biomedical Text Mining (BioBERT) Fine-tuned GPT-3.5 consistently outperformed other models in F1 scores across all relation types.
arXiv Detail & Related papers (2024-04-08T11:33:00Z)
Reshaping Free-Text Radiology Notes Into Structured Reports With Generative Transformers [0.29530625605275984]
structured reporting (SR) has been recommended by various medical societies. We propose a pipeline to extract information from free-text reports. Our work aims to leverage the potential of Natural Language Processing (NLP) and Transformer-based models.
arXiv Detail & Related papers (2024-03-27T18:38:39Z)
How Well Do Multi-modal LLMs Interpret CT Scans? An Auto-Evaluation Framework for Analyses [14.884877292068351]
This study introduces a novel evaluation framework, named GPTRadScore'' It assesses the capabilities of multi-modal LLMs, such as GPT-4 with Vision (GPT-4V), Gemini Pro Vision, LLaVA-Med, and RadFM, in generating descriptions for prospectively-identified findings. By employing a decomposition technique based on GPT-4, GPTRadScore compares these generated descriptions with gold-standard report sentences, analyzing their accuracy in terms of body part, location, and type of finding.
arXiv Detail & Related papers (2024-03-08T21:16:28Z)
Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for Radiology Reports [22.599250713630333]
Our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs) Our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI generated reports. Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a 0.48 score, outperforming the METEOR metric by 0.19.
arXiv Detail & Related papers (2024-01-29T21:24:43Z)
Holistic Evaluation of GPT-4V for Biomedical Imaging [113.46226609088194]
GPT-4V represents a breakthrough in artificial general intelligence for computer vision. We assess GPT-4V's performance across 16 medical imaging categories, including radiology, oncology, ophthalmology, pathology, and more. Results show GPT-4V's proficiency in modality and anatomy recognition but difficulty with disease diagnosis and localization.
arXiv Detail & Related papers (2023-11-10T18:40:44Z)
A Systematic Evaluation of GPT-4V's Multimodal Capability for Medical Image Analysis [87.25494411021066]
GPT-4V's multimodal capability for medical image analysis is evaluated. It is found that GPT-4V excels in understanding medical images and generates high-quality radiology reports. It is found that its performance for medical visual grounding needs to be substantially improved.
arXiv Detail & Related papers (2023-10-31T11:39:09Z)
Exploring the Boundaries of GPT-4 in Radiology [46.30976153809968]
GPT-4 has a sufficient level of radiology knowledge with only occasional errors in complex context. For findings summarisation, GPT-4 outputs are found to be overall comparable with existing manually-written impressions.
arXiv Detail & Related papers (2023-10-23T05:13:03Z)
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5. We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z)
Capabilities of GPT-4 on Medical Challenge Problems [23.399857819743158]
GPT-4 is a general-purpose model that is not specialized for medical problems through training or to solve clinical tasks. We present a comprehensive evaluation of GPT-4 on medical competency examinations and benchmark datasets.
arXiv Detail & Related papers (2023-03-20T16:18:38Z)
Evaluating Psychological Safety of Large Language Models [72.88260608425949]
We designed unbiased prompts to evaluate the psychological safety of large language models (LLMs) We tested five different LLMs by using two personality tests: Short Dark Triad (SD-3) and Big Five Inventory (BFI) Despite being instruction fine-tuned with safety metrics to reduce toxicity, InstructGPT, GPT-3.5, and GPT-4 still showed dark personality patterns. Fine-tuning Llama-2-chat-7B with responses from BFI using direct preference optimization could effectively reduce the psychological toxicity of the model.
arXiv Detail & Related papers (2022-12-20T18:45:07Z)
Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.