Do Physicians Know How to Prompt? The Need for Automatic Prompt Optimization Help in Clinical Note Generation
- URL: http://arxiv.org/abs/2311.09684v3
- Date: Fri, 5 Jul 2024 09:14:11 GMT
- Title: Do Physicians Know How to Prompt? The Need for Automatic Prompt Optimization Help in Clinical Note Generation
- Authors: Zonghai Yao, Ahmed Jaafar, Beining Wang, Zhichao Yang, Hong Yu,
- Abstract summary: We introduce an Automatic Prompt Optimization framework to refine initial prompts and compare the outputs of medical experts, non-medical experts, and APO-enhanced GPT3.5 and GPT4.
Results highlight GPT4 APO's superior performance in standardizing prompt quality across clinical note sections.
A human-in-the-loop approach shows that experts maintain content quality post-APO, with a preference for their own modifications.
- Score: 7.434565130637974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study examines the effect of prompt engineering on the performance of Large Language Models (LLMs) in clinical note generation. We introduce an Automatic Prompt Optimization (APO) framework to refine initial prompts and compare the outputs of medical experts, non-medical experts, and APO-enhanced GPT3.5 and GPT4. Results highlight GPT4 APO's superior performance in standardizing prompt quality across clinical note sections. A human-in-the-loop approach shows that experts maintain content quality post-APO, with a preference for their own modifications, suggesting the value of expert customization. We recommend a two-phase optimization process, leveraging APO-GPT4 for consistency and expert input for personalization.
Related papers
- Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z) - A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment [46.776978552161395]
Small language models (SLMs) offer a cost-effective alternative to large language models such as GPT-4.<n>SLMs offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation.<n>We propose a novel framework for adapting SLMs into high-performing clinical models.
arXiv Detail & Related papers (2025-05-15T21:40:21Z) - GPT-4 on Clinic Depression Assessment: An LLM-Based Pilot Study [0.6999740786886538]
We explore the use of GPT-4 for clinical depression assessment based on transcript analysis.
We examine the model's ability to classify patient interviews into binary categories: depressed and not depressed.
Results indicate that GPT-4 exhibits considerable variability in accuracy and F1-Score across configurations.
arXiv Detail & Related papers (2024-12-31T00:32:43Z) - Uncertainty-Penalized Direct Preference Optimization [52.387088396044206]
We develop a pessimistic framework for DPO by introducing preference uncertainty penalization schemes.
The penalization serves as a correction to the loss which attenuates the loss gradient for uncertain samples.
We show improved overall performance compared to vanilla DPO, as well as better completions on prompts from high-uncertainty chosen/rejected responses.
arXiv Detail & Related papers (2024-10-26T14:24:37Z) - Fine Tuning Large Language Models for Medicine: The Role and Importance of Direct Preference Optimization [2.096816583842973]
Two of the most common methods of fine tuning are Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO)
We compare the performance of SFT and DPO for five common natural language tasks in medicine.
We find that SFT alone is sufficient for Classification with text data, whereas DPO improves performance for the more complex tasks of Clinical Reasoning, Summarization and Clinical Triage.
arXiv Detail & Related papers (2024-09-19T13:03:24Z) - RuleAlign: Making Large Language Models Better Physicians with Diagnostic Rule Alignment [54.91736546490813]
We introduce the RuleAlign framework, designed to align Large Language Models with specific diagnostic rules.
We develop a medical dialogue dataset comprising rule-based communications between patients and physicians.
Experimental results demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-08-22T17:44:40Z) - High-Throughput Phenotyping of Clinical Text Using Large Language Models [0.0]
GPT-4 surpasses GPT-3.5-Turbo in identifying, categorizing, and normalizing signs.
GPT-4 results in high performance and generalizability across several phenotyping tasks.
arXiv Detail & Related papers (2024-08-02T12:00:00Z) - Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level [50.897438358317686]
We show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity.
Specifically, our 7B model achieves a $50.5%$ length-controlled win rate against $texttGPT-4 Preview$ on AlpacaEval 2.0.
arXiv Detail & Related papers (2024-06-17T17:55:38Z) - FIPO: Free-form Instruction-oriented Prompt Optimization with Preference Dataset and Modular Fine-tuning Schema [36.65009632307124]
We propose Free-from Instruction-oriented Prompt Optimization (FIPO) to improve task performance of large language models (LLMs)
FIPO uses a modular APO template that dynamically integrate the naive task instruction, optional instruction responses, and optional ground truth to produce finely optimized prompts.
We validate FIPO framework across five public benchmarks and six testing models.
arXiv Detail & Related papers (2024-02-19T03:56:44Z) - Enhancing Medical Task Performance in GPT-4V: A Comprehensive Study on
Prompt Engineering Strategies [28.98518677093905]
GPT-4V, OpenAI's latest large vision-language model, has piqued considerable interest for its potential in medical applications.
Recent studies and internal reviews highlight its underperformance in specialized medical tasks.
This paper explores the boundary of GPT-4V's capabilities in medicine, particularly in processing complex imaging data from endoscopies, CT scans, and MRIs etc.
arXiv Detail & Related papers (2023-12-07T15:05:59Z) - Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case
Study in Medicine [89.46836590149883]
We build on a prior study of GPT-4's capabilities on medical challenge benchmarks in the absence of special training.
We find that prompting innovation can unlock deeper specialist capabilities and show that GPT-4 easily tops prior leading results for medical benchmarks.
With Medprompt, GPT-4 achieves state-of-the-art results on all nine of the benchmark datasets in the MultiMedQA suite.
arXiv Detail & Related papers (2023-11-28T03:16:12Z) - ExpertPrompting: Instructing Large Language Models to be Distinguished
Experts [93.58012324415762]
ExpertPrompting elicits the potential of large language models to answer as distinguished experts.
We produce a new set of instruction-following data using GPT-3.5, and train a competitive open-source chat assistant called ExpertLLaMA.
arXiv Detail & Related papers (2023-05-24T03:51:31Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.