Continuous Training and Fine-tuning for Domain-Specific Language Models
in Medical Question Answering
- URL: http://arxiv.org/abs/2311.00204v1
- Date: Wed, 1 Nov 2023 00:18:00 GMT
- Title: Continuous Training and Fine-tuning for Domain-Specific Language Models
in Medical Question Answering
- Authors: Zhen Guo, Yining Hua
- Abstract summary: Large language models exhibit promising general capabilities but often lack specialized knowledge for domain-specific tasks.
This work demonstrates a method using continuous training and instruction fine-tuning to rapidly adapt Llama 2 base models to the Chinese medical domain.
- Score: 4.254954312483959
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models exhibit promising general capabilities but often lack
specialized knowledge for domain-specific tasks. Developing domain experts from
a base model enables a range of applications without prohibitive training
costs. This work demonstrates a method using continuous training and
instruction fine-tuning to rapidly adapt Llama 2 base models to the Chinese
medical domain. We first conduct continuous training on 1B tokens from Chinese
medical references to teach relevant vocabulary and knowledge. The models are
then fine-tuned on 54K examples sourced from the Chinese National Medical
Licensing Examination. Experiments on Chinese medical data confirm the
effectiveness of this approach, producing a model comparable to GPT-3.5-turbo
while using way less computational resource. The resulting domain-specific
model could be useful for various Chinese medical applications. More broadly,
this provides a template for domain-specific training of large language models
in areas where pre-trained models lack the required expertise, such as law,
science, and engineering.
Related papers
- Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings [10.39989311209284]
We provide a comprehensive survey of language models in the medical domain.
Our subset encompasses 53 models, ranging from 110 million to 13 billion parameters.
Our findings reveal remarkable performance across various tasks and datasets.
arXiv Detail & Related papers (2024-06-24T12:52:02Z) - Medical Vision-Language Pre-Training for Brain Abnormalities [96.1408455065347]
We show how to automatically collect medical image-text aligned data for pretraining from public resources such as PubMed.
In particular, we present a pipeline that streamlines the pre-training process by initially collecting a large brain image-text dataset.
We also investigate the unique challenge of mapping subfigures to subcaptions in the medical domain.
arXiv Detail & Related papers (2024-04-27T05:03:42Z) - Comprehensive Study on German Language Models for Clinical and Biomedical Text Understanding [16.220303664681172]
We pre-trained several German medical language models on 2.4B tokens derived from translated public English medical data and 3B tokens of German clinical data.
The resulting models were evaluated on various German downstream tasks, including named entity recognition (NER), multi-label classification, and extractive question answering.
We conclude that continuous pre-training has demonstrated the ability to match or even exceed the performance of clinical models trained from scratch.
arXiv Detail & Related papers (2024-04-08T17:24:04Z) - KBioXLM: A Knowledge-anchored Biomedical Multilingual Pretrained
Language Model [37.69464822182714]
Most biomedical pretrained language models are monolingual and cannot handle the growing cross-lingual requirements.
We propose a model called KBioXLM, which transforms the multilingual pretrained model XLM-R into the biomedical domain using a knowledge-anchored approach.
arXiv Detail & Related papers (2023-11-20T07:02:35Z) - HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs [62.73042700847977]
HuatuoGPT-II has shown state-of-the-art performance in Chinese medicine domain on a number of benchmarks.
It even outperforms proprietary models like ChatGPT and GPT-4 in some aspects, especially in Traditional Chinese Medicine.
arXiv Detail & Related papers (2023-11-16T10:56:24Z) - MedChatZH: a Better Medical Adviser Learns from Better Instructions [11.08819869122466]
We introduce MedChatZH, a dialogue model designed specifically for traditional Chinese medical QA.
Our model is pre-trained on Chinese traditional medical books and fine-tuned with a carefully curated medical instruction dataset.
It outperforms several solid baselines on a real-world medical dialogue dataset.
arXiv Detail & Related papers (2023-09-03T08:08:15Z) - Towards Best Practices for Training Multilingual Dense Retrieval Models [54.91016739123398]
We focus on the task of monolingual retrieval in a variety of typologically diverse languages using one such design.
Our study is organized as a "best practices" guide for training multilingual dense retrieval models.
arXiv Detail & Related papers (2022-04-05T17:12:53Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.