Conceptualized Representation Learning for Chinese Biomedical Text
Mining
- URL: http://arxiv.org/abs/2008.10813v1
- Date: Tue, 25 Aug 2020 04:41:35 GMT
- Title: Conceptualized Representation Learning for Chinese Biomedical Text
Mining
- Authors: Ningyu Zhang, Qianghuai Jia, Kangping Yin, Liang Dong, Feng Gao,
Nengwei Hua
- Abstract summary: We investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora.
For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations.
- Score: 14.77516568767045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Biomedical text mining is becoming increasingly important as the number of
biomedical documents and web data rapidly grows. Recently, word representation
models such as BERT has gained popularity among researchers. However, it is
difficult to estimate their performance on datasets containing biomedical texts
as the word distributions of general and biomedical corpora are quite
different. Moreover, the medical domain has long-tail concepts and
terminologies that are difficult to be learned via language models. For the
Chinese biomedical text, it is more difficult due to its complex structure and
the variety of phrase combinations. In this paper, we investigate how the
recently introduced pre-trained language model BERT can be adapted for Chinese
biomedical corpora and propose a novel conceptualized representation learning
approach. We also release a new Chinese Biomedical Language Understanding
Evaluation benchmark (\textbf{ChineseBLUE}). We examine the effectiveness of
Chinese pre-trained models: BERT, BERT-wwm, RoBERTa, and our approach.
Experimental results on the benchmark show that our approach could bring
significant gain. We release the pre-trained model on GitHub:
https://github.com/alibaba-research/ChineseBLUE.
Related papers
- Diversifying Knowledge Enhancement of Biomedical Language Models using
Adapter Modules and Knowledge Graphs [54.223394825528665]
We develop an approach that uses lightweight adapter modules to inject structured biomedical knowledge into pre-trained language models.
We use two large KGs, the biomedical knowledge system UMLS and the novel biochemical OntoChem, with two prominent biomedical PLMs, PubMedBERT and BioLinkBERT.
We show that our methodology leads to performance improvements in several instances while keeping requirements in computing power low.
arXiv Detail & Related papers (2023-12-21T14:26:57Z) - BioLORD-2023: Semantic Textual Representations Fusing LLM and Clinical
Knowledge Graph Insights [15.952942443163474]
We propose a new state-of-the-art approach for obtaining high-fidelity representations of biomedical concepts and sentences.
We demonstrate consistent and substantial performance improvements over the previous state of the art.
Besides our new state-of-the-art biomedical model for English, we also distill and release a multilingual model compatible with 50+ languages.
arXiv Detail & Related papers (2023-11-27T18:46:17Z) - LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics.
This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z) - BioGPT: Generative Pre-trained Transformer for Biomedical Text
Generation and Mining [140.61707108174247]
We propose BioGPT, a domain-specific generative Transformer language model pre-trained on large scale biomedical literature.
We get 44.98%, 38.42% and 40.76% F1 score on BC5CDR, KD-DTI and DDI end-to-end relation extraction tasks respectively, and 78.2% accuracy on PubMedQA.
arXiv Detail & Related papers (2022-10-19T07:17:39Z) - RuBioRoBERTa: a pre-trained biomedical language model for Russian
language biomedical text mining [117.56261821197741]
We present several BERT-based models for Russian language biomedical text mining.
The models are pre-trained on a corpus of freely available texts in the Russian biomedical domain.
arXiv Detail & Related papers (2022-04-08T09:18:59Z) - Building Chinese Biomedical Language Models via Multi-Level Text
Discrimination [24.992542216072152]
We introduce eHealth, a biomedical PLM in Chinese built with a new pre-training framework.
This new framework trains eHealth as a discriminator through both token-level and sequence-level discrimination.
EHealth can learn language semantics at both the token and sequence levels.
arXiv Detail & Related papers (2021-10-14T10:43:28Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - A Multilingual Neural Machine Translation Model for Biomedical Data [84.17747489525794]
We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain.
The model can translate from 5 languages (French, German, Italian, Korean and Spanish) into English.
It is trained with large amounts of generic and biomedical data, using domain tags.
arXiv Detail & Related papers (2020-08-06T21:26:43Z) - Pre-training technique to localize medical BERT and enhance biomedical
BERT [0.0]
It is difficult to train specific BERT models that perform well for domains in which there are few publicly available databases of high quality and large size.
We propose a single intervention with one option: simultaneous pre-training after up-sampling and amplified vocabulary.
Our Japanese medical BERT outperformed conventional baselines and the other BERT models in terms of the medical document classification task.
arXiv Detail & Related papers (2020-05-14T18:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.