Considerations for health care institutions training large language
models on electronic health records
- URL: http://arxiv.org/abs/2309.12339v1
- Date: Thu, 24 Aug 2023 00:09:01 GMT
- Title: Considerations for health care institutions training large language
models on electronic health records
- Authors: Weipeng Zhou, Danielle Bitterman, Majid Afshar, Timothy A. Miller
- Abstract summary: Large language models (LLMs) like ChatGPT have excited scientists across fields.
In medicine, one source of excitement is the potential applications of LLMs trained on electronic health record ( EHR) data.
But there are tough questions we must first answer if health care institutions are interested in having LLMs trained on their own data.
- Score: 7.048517095805301
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) like ChatGPT have excited scientists across
fields; in medicine, one source of excitement is the potential applications of
LLMs trained on electronic health record (EHR) data. But there are tough
questions we must first answer if health care institutions are interested in
having LLMs trained on their own data; should they train an LLM from scratch or
fine-tune it from an open-source model? For healthcare institutions with a
predefined budget, what are the biggest LLMs they can afford? In this study, we
take steps towards answering these questions with an analysis on dataset sizes,
model sizes, and costs for LLM training using EHR data. This analysis provides
a framework for thinking about these questions in terms of data scale, compute
scale, and training budgets.
Related papers
- Position: The Most Expensive Part of an LLM should be its Training Data [38.3722794045587]
Training a Large Language Model (LLM) is an increasingly expensive endeavor due to growing computational, hardware, energy, and engineering demands.
Yet, an often-overlooked (and seldom paid) expense is the human labor behind these models' training data.
This position paper aims to assign a monetary value to this labor and argues that the most expensive part of producing an LLM should be the compensation provided to training data producers.
arXiv Detail & Related papers (2025-04-16T18:56:14Z) - Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models [52.439289085318634]
We show how to identify training data known to proprietary large language models (LLMs) by using information-guided probes.
Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes.
arXiv Detail & Related papers (2025-03-15T10:19:15Z) - Question Answering on Patient Medical Records with Private Fine-Tuned LLMs [1.8524621910043437]
Large Language Models (LLMs) enable semantic question answering (QA) over medical data.
ensuring privacy and compliance requires edge and private deployments of LLMs.
We evaluate privately hosted, fine-tuned LLMs against benchmark models such as GPT-4 and GPT-4o.
arXiv Detail & Related papers (2025-01-23T14:13:56Z) - Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources [0.0]
We present a medical adaptation based on the recent 7B models, which enables the operation in low computational resources.
We find that fine-tuning an English-centric base model on Japanese medical dataset improves the score in both language.
arXiv Detail & Related papers (2024-09-18T08:07:37Z) - A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations [5.265452667976959]
This survey systematically summarizes how to train medical LLMs based on open-source general LLMs.
It covers (a) how to acquire training corpus and construct customized medical training sets, (b) how to choose an appropriate training paradigm, and (d) existing challenges and promising research directions.
arXiv Detail & Related papers (2024-06-14T02:42:20Z) - Retrieval Augmented Thought Process for Private Data Handling in Healthcare [53.89406286212502]
We introduce the Retrieval-Augmented Thought Process (RATP)
RATP formulates the thought generation of Large Language Models (LLMs)
On a private dataset of electronic medical records, RATP achieves 35% additional accuracy compared to in-context retrieval-augmented generation for the question-answering task.
arXiv Detail & Related papers (2024-02-12T17:17:50Z) - Large Language Models: A Survey [69.72787936480394]
Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks.
LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data.
arXiv Detail & Related papers (2024-02-09T05:37:09Z) - ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain.
ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF.
We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z) - A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language.
This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z) - LLM-based Medical Assistant Personalization with Short- and Long-Term Memory Coordination [20.269899169364397]
Large Language Models (LLMs) have exhibited remarkable proficiency in comprehending and generating natural language.
We propose a novel computational bionic memory mechanism, equipped with a parameter-efficient fine-tuning (PEFT) schema, to personalize medical assistants.
arXiv Detail & Related papers (2023-09-21T00:34:33Z) - Balanced and Explainable Social Media Analysis for Public Health with
Large Language Models [13.977401672173533]
Current techniques for public health analysis involve popular models such as BERT and large language models (LLMs)
To tackle these challenges, the data imbalance issue can be overcome by sophisticated data augmentation methods for social media datasets.
In this paper, a novel ALEX framework is proposed for social media analysis on public health.
arXiv Detail & Related papers (2023-09-12T04:15:34Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Aligning Large Language Models with Human: A Survey [53.6014921995006]
Large Language Models (LLMs) trained on extensive textual corpora have emerged as leading solutions for a broad array of Natural Language Processing (NLP) tasks.
Despite their notable performance, these models are prone to certain limitations such as misunderstanding human instructions, generating potentially biased content, or factually incorrect information.
This survey presents a comprehensive overview of these alignment technologies, including the following aspects.
arXiv Detail & Related papers (2023-07-24T17:44:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.