Continuous Training and Fine-tuning for Domain-Specific Language Models
in Medical Question Answering
- URL: http://arxiv.org/abs/2311.00204v1
- Date: Wed, 1 Nov 2023 00:18:00 GMT
- Title: Continuous Training and Fine-tuning for Domain-Specific Language Models
in Medical Question Answering
- Authors: Zhen Guo, Yining Hua
- Abstract summary: Large language models exhibit promising general capabilities but often lack specialized knowledge for domain-specific tasks.
This work demonstrates a method using continuous training and instruction fine-tuning to rapidly adapt Llama 2 base models to the Chinese medical domain.
- Score: 4.254954312483959
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models exhibit promising general capabilities but often lack
specialized knowledge for domain-specific tasks. Developing domain experts from
a base model enables a range of applications without prohibitive training
costs. This work demonstrates a method using continuous training and
instruction fine-tuning to rapidly adapt Llama 2 base models to the Chinese
medical domain. We first conduct continuous training on 1B tokens from Chinese
medical references to teach relevant vocabulary and knowledge. The models are
then fine-tuned on 54K examples sourced from the Chinese National Medical
Licensing Examination. Experiments on Chinese medical data confirm the
effectiveness of this approach, producing a model comparable to GPT-3.5-turbo
while using way less computational resource. The resulting domain-specific
model could be useful for various Chinese medical applications. More broadly,
this provides a template for domain-specific training of large language models
in areas where pre-trained models lack the required expertise, such as law,
science, and engineering.
Related papers
- DSG-KD: Knowledge Distillation from Domain-Specific to General Language Models [8.328673243329794]
This study investigates emergency/non-emergency classification tasks based on electronic medical record (EMR) data obtained from pediatric emergency departments (PEDs) in Korea.
Existing domain-specific pre-trained language models underperform compared to general language models in handling N-lingual free-text data characteristics.
We propose a domain knowledge transfer methodology that leverages knowledge distillation to infuse general language models with domain-specific knowledge via fine-tuning.
arXiv Detail & Related papers (2024-09-23T10:59:02Z) - Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts.
Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models.
The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z) - Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings [10.39989311209284]
We have conducted a comprehensive survey of language models in the medical field.
We evaluated a subset of these for medical text classification and conditional text generation.
The results reveal remarkable performance across the tasks and evaluated, underscoring the potential of certain models to contain medical knowledge.
arXiv Detail & Related papers (2024-06-24T12:52:02Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding [16.220303664681172]
We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data.
The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering.
We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch.
arXiv Detail & Related papers (2024-04-08T17:24:04Z) - KBioXLM: A Knowledge-anchored Biomedical Multilingual Pretrained
Language Model [37.69464822182714]
Most biomedical pretrained language models are monolingual and cannot handle the growing cross-lingual requirements.
We propose a model called KBioXLM, which transforms the multilingual pretrained model XLM-R into the biomedical domain using a knowledge-anchored approach.
arXiv Detail & Related papers (2023-11-20T07:02:35Z) - HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs [61.41790586411816]
HuatuoGPT-II has shown state-of-the-art performance in Chinese medicine domain on a number of benchmarks.
It even outperforms proprietary models like ChatGPT and GPT-4 in some aspects, especially in Traditional Chinese Medicine.
arXiv Detail & Related papers (2023-11-16T10:56:24Z) - MedChatZH: a Better Medical Adviser Learns from Better Instructions [11.08819869122466]
We introduce MedChatZH, a dialogue model designed specifically for traditional Chinese medical QA.
Our model is pre-trained on Chinese traditional medical books and fine-tuned with a carefully curated medical instruction dataset.
It outperforms several solid baselines on a real-world medical dialogue dataset.
arXiv Detail & Related papers (2023-09-03T08:08:15Z) - Towards Best Practices for Training Multilingual Dense Retrieval Models [54.91016739123398]
We focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design.
Our study is organized as a "best practices" guide for training multilingual dense retrieval models.
arXiv Detail & Related papers (2022-04-05T17:12:53Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.