Enriching Biomedical Knowledge for Low-resource Language Through
Translation
- URL: http://arxiv.org/abs/2210.05598v1
- Date: Tue, 11 Oct 2022 16:35:10 GMT
- Title: Enriching Biomedical Knowledge for Low-resource Language Through
Translation
- Authors: Long Phan, Tai Dang, Hieu Tran, Vy Phan, Lam D. Chau, and Trieu H.
Trinh
- Abstract summary: We make use of a state-of-the-art translation model in English-Vietnamese to translate and produce both pretrained as well as supervised data in the biomedical domains.
Thanks to such large-scale translation, we introduce ViPubmedT5, a pretrained-Decoder Transformer model trained on 20 million abstracts from the high-quality public corpus.
- Score: 1.6347851388527643
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Biomedical data and benchmarks are highly valuable yet very limited in
low-resource languages other than English such as Vietnamese. In this paper, we
make use of a state-of-the-art translation model in English-Vietnamese to
translate and produce both pretrained as well as supervised data in the
biomedical domains. Thanks to such large-scale translation, we introduce
ViPubmedT5, a pretrained Encoder-Decoder Transformer model trained on 20
million translated abstracts from the high-quality public PubMed corpus.
ViPubMedT5 demonstrates state-of-the-art results on two different biomedical
benchmarks in summarization and acronym disambiguation. Further, we release
ViMedNLI - a new NLP task in Vietnamese translated from MedNLI using the
recently public En-vi translation model and carefully refined by human experts,
with evaluations of existing methods against ViPubmedT5.
Related papers
- Medical mT5: An Open-Source Multilingual Text-to-Text LLM for The Medical Domain [19.58987478434808]
We present Medical mT5, the first open-source text-to-text multilingual model for the medical domain.
A comprehensive evaluation shows that Medical mT5 outperforms both encoders and similarly sized text-to-text models for the Spanish, French, and Italian benchmarks.
arXiv Detail & Related papers (2024-04-11T10:01:32Z) - Improving Vietnamese-English Medical Machine Translation [14.172448099399407]
MedEV is a high-quality Vietnamese-English parallel dataset constructed specifically for the medical domain, comprising approximately 360K sentence pairs.
We conduct extensive experiments comparing Google Translate, ChatGPT (gpt-3.5-turbo), state-of-the-art Vietnamese-English neural machine translation models and pre-trained bilingual/multilingual sequence-to-sequence models on our new MedEV dataset.
Experimental results show that the best performance is achieved by fine-tuning "vinai-translate" for each translation direction.
arXiv Detail & Related papers (2024-03-28T06:07:15Z) - BiMediX: Bilingual Medical Mixture of Experts LLM [94.85518237963535]
We introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic.
Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details.
We propose a semi-automated English-to-Arabic translation pipeline with human refinement to ensure high-quality translations.
arXiv Detail & Related papers (2024-02-20T18:59:26Z) - Importance-Aware Data Augmentation for Document-Level Neural Machine
Translation [51.74178767827934]
Document-level neural machine translation (DocNMT) aims to generate translations that are both coherent and cohesive.
Due to its longer input length and limited availability of training data, DocNMT often faces the challenge of data sparsity.
We propose a novel Importance-Aware Data Augmentation (IADA) algorithm for DocNMT that augments the training data based on token importance information estimated by the norm of hidden states and training gradients.
arXiv Detail & Related papers (2024-01-27T09:27:47Z) - LLaVA-Med: Training a Large Language-and-Vision Assistant for
Biomedicine in One Day [85.19963303642427]
We propose a cost-efficient approach for training a vision-language conversational assistant that can answer open-ended research questions of biomedical images.
The model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics.
This enables us to train a Large Language and Vision Assistant for BioMedicine in less than 15 hours (with eight A100s)
arXiv Detail & Related papers (2023-06-01T16:50:07Z) - MTet: Multi-domain Translation for English and Vietnamese [10.126442202316825]
MTet is the largest publicly available parallel corpus for English-Vietnamese translation.
We release the first pretrained model EnViT5 for English and Vietnamese languages.
arXiv Detail & Related papers (2022-10-11T16:55:21Z) - CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark [51.38557174322772]
We present the first Chinese Biomedical Language Understanding Evaluation benchmark.
It is a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification.
We report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling.
arXiv Detail & Related papers (2021-06-15T12:25:30Z) - Conceptualized Representation Learning for Chinese Biomedical Text
Mining [14.77516568767045]
We investigate how the recently introduced pre-trained language model BERT can be adapted for Chinese biomedical corpora.
For the Chinese biomedical text, it is more difficult due to its complex structure and the variety of phrase combinations.
arXiv Detail & Related papers (2020-08-25T04:41:35Z) - A Multilingual Neural Machine Translation Model for Biomedical Data [84.17747489525794]
We release a multilingual neural machine translation model, which can be used to translate text in the biomedical domain.
The model can translate from 5 languages (French, German, Italian, Korean and Spanish) into English.
It is trained with large amounts of generic and biomedical data, using domain tags.
arXiv Detail & Related papers (2020-08-06T21:26:43Z) - Bootstrapping a Crosslingual Semantic Parser [74.99223099702157]
We adapt a semantic trained on a single language, such as English, to new languages and multiple domains with minimal annotation.
We query if machine translation is an adequate substitute for training data, and extend this to investigate bootstrapping using joint training with English, paraphrasing, and multilingual pre-trained models.
arXiv Detail & Related papers (2020-04-06T12:05:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.