Building Chinese Biomedical Language Models via Multi-Level Text
Discrimination
- URL: http://arxiv.org/abs/2110.07244v1
- Date: Thu, 14 Oct 2021 10:43:28 GMT
- Title: Building Chinese Biomedical Language Models via Multi-Level Text
Discrimination
- Authors: Quan Wang and Songtai Dai and Benfeng Xu and Yajuan Lyu and Yong Zhu
and Hua Wu and Haifeng Wang
- Abstract summary: We introduce eHealth, a biomedical PLM in Chinese built with a new pre-training framework.
This new framework trains eHealth as a discriminator through both token-level and sequence-level discrimination.
EHealth can learn language semantics at both the token and sequence levels.
- Score: 24.992542216072152
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Pre-trained language models (PLMs), such as BERT and GPT, have revolutionized
the field of NLP, not only in the general domain but also in the biomedical
domain. Most prior efforts in building biomedical PLMs have resorted simply to
domain adaptation and focused mainly on English. In this work we introduce
eHealth, a biomedical PLM in Chinese built with a new pre-training framework.
This new framework trains eHealth as a discriminator through both token-level
and sequence-level discrimination. The former is to detect input tokens
corrupted by a generator and select their original signals from plausible
candidates, while the latter is to further distinguish corruptions of a same
original sequence from those of the others. As such, eHealth can learn language
semantics at both the token and sequence levels. Extensive experiments on 11
Chinese biomedical language understanding tasks of various forms verify the
effectiveness and superiority of our approach. The pre-trained model is
available to the public at
\url{https://github.com/PaddlePaddle/Research/tree/master/KG/eHealth} and the
code will also be released later.
Related papers
- KBioXLM: A Knowledge-anchored Biomedical Multilingual Pretrained
Language Model [37.69464822182714]
Most biomedical pretrained language models are monolingual and cannot handle the growing cross-lingual requirements.
We propose a model called KBioXLM, which transforms the multilingual pretrained model XLM-R into the biomedical domain using a knowledge-anchored approach.
arXiv Detail & Related papers (2023-11-20T07:02:35Z) - Language Generation from Brain Recordings [68.97414452707103]
We propose a generative language BCI that utilizes the capacity of a large language model and a semantic brain decoder.
The proposed model can generate coherent language sequences aligned with the semantic content of visual or auditory language stimuli.
Our findings demonstrate the potential and feasibility of employing BCIs in direct language generation.
arXiv Detail & Related papers (2023-11-16T13:37:21Z) - BioGPT: Generative Pre-trained Transformer for Biomedical Text
Generation and Mining [140.61707108174247]
We propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
We get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA.
arXiv Detail & Related papers (2022-10-19T07:17:39Z) - Investigating Massive Multilingual Pre-Trained Machine Translation
Models for Clinical Domain via Transfer Learning [11.571189144910521]
This work investigates whether MMPLMs can be applied to clinical domain machine translation (MT) towards entirely unseen languages via transfer learning.
Massively multilingual pre-trained language models (MMPLMs) are developed in recent years demonstrating superpowers and the pre-knowledge they acquire for downstream tasks.
arXiv Detail & Related papers (2022-10-12T10:19:44Z) - BioBART: Pretraining and Evaluation of A Biomedical Generative Language
Model [1.1764594853212893]
In this work, we introduce the generative language model BioBART that adapts BART to the biomedical domain.
We collate various biomedical language generation tasks including dialogue, summarization, entity linking, and named entity recognition.
BioBART pretrained on PubMed abstracts has enhanced performance compared to BART and set strong baselines on several tasks.
arXiv Detail & Related papers (2022-04-08T08:07:42Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Conceptualized Representation Learning for Chinese Biomedical Text
Mining [14.77516568767045]
We investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora.
For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations.
arXiv Detail & Related papers (2020-08-25T04:41:35Z) - Domain-Specific Language Model Pretraining for Biomedical Natural
Language Processing [73.37262264915739]
We show that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains.
Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks.
arXiv Detail & Related papers (2020-07-31T00:04:15Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.