COVID-19 Named Entity Recognition for Vietnamese
- URL: http://arxiv.org/abs/2104.03879v1
- Date: Thu, 8 Apr 2021 16:35:34 GMT
- Title: COVID-19 Named Entity Recognition for Vietnamese
- Authors: Thinh Hung Truong, Mai Hoang Dao, Dat Quoc Nguyen
- Abstract summary: We present the first manually-annotated COVID-19 domain-specific dataset for Vietnamese.
Our dataset is annotated for the named entity recognition task with newly-defined entity types.
Our dataset also contains the largest number of entities compared to existing Vietnamese NER datasets.
- Score: 6.17059264011429
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The current COVID-19 pandemic has lead to the creation of many corpora that
facilitate NLP research and downstream applications to help fight the pandemic.
However, most of these corpora are exclusively for English. As the pandemic is
a global problem, it is worth creating COVID-19 related datasets for languages
other than English. In this paper, we present the first manually-annotated
COVID-19 domain-specific dataset for Vietnamese. Particularly, our dataset is
annotated for the named entity recognition (NER) task with newly-defined entity
types that can be used in other future epidemics. Our dataset also contains the
largest number of entities compared to existing Vietnamese NER datasets. We
empirically conduct experiments using strong baselines on our dataset, and find
that: automatic Vietnamese word segmentation helps improve the NER results and
the highest performances are obtained by fine-tuning pre-trained language
models where the monolingual model PhoBERT for Vietnamese (Nguyen and Nguyen,
2020) produces higher results than the multilingual model XLM-R (Conneau et
al., 2020). We publicly release our dataset at:
https://github.com/VinAIResearch/PhoNER_COVID19
Related papers
- Medical Spoken Named Entity Recognition [18.348129901298652]
We present VietMed-NER - the first spoken NER dataset in the medical domain.
We present baseline results using various state-of-the-art pre-trained models.
By simply translating, the transcript is applicable not just to Vietnamese but to other languages as well.
arXiv Detail & Related papers (2024-06-19T08:39:09Z) - MSciNLI: A Diverse Benchmark for Scientific Natural Language Inference [65.37685198688538]
This paper presents MSciNLI, a dataset containing 132,320 sentence pairs extracted from five new scientific domains.
We establish strong baselines on MSciNLI by fine-tuning Pre-trained Language Models (PLMs) and prompting Large Language Models (LLMs)
We show that domain shift degrades the performance of scientific NLI models which demonstrates the diverse characteristics of different domains in our dataset.
arXiv Detail & Related papers (2024-04-11T18:12:12Z) - Improving Vietnamese-English Medical Machine Translation [14.172448099399407]
MedEV is a high-quality Vietnamese-English parallel dataset constructed specifically for the medical domain, comprising approximately 360K sentence pairs.
We conduct extensive experiments comparing Google Translate, ChatGPT (gpt-3.5-turbo), state-of-the-art Vietnamese-English neural machine translation models and pre-trained bilingual/multilingual sequence-to-sequence models on our new MedEV dataset.
Experimental results show that the best performance is achieved by fine-tuning "vinai-translate" for each translation direction.
arXiv Detail & Related papers (2024-03-28T06:07:15Z) - Improving Domain-Specific Retrieval by NLI Fine-Tuning [64.79760042717822]
This article investigates the fine-tuning potential of natural language inference (NLI) data to improve information retrieval and ranking.
We employ both monolingual and multilingual sentence encoders fine-tuned by a supervised method utilizing contrastive loss and NLI data.
Our results point to the fact that NLI fine-tuning increases the performance of the models in both tasks and both languages, with the potential to improve mono- and multilingual models.
arXiv Detail & Related papers (2023-08-06T12:40:58Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - MasakhaNER 2.0: Africa-centric Transfer Learning for Named Entity
Recognition [55.95128479289923]
African languages are spoken by over a billion people, but are underrepresented in NLP research and development.
We create the largest human-annotated NER dataset for 20 African languages.
We show that choosing the best transfer language improves zero-shot F1 scores by an average of 14 points.
arXiv Detail & Related papers (2022-10-22T08:53:14Z) - Sentence Extraction-Based Machine Reading Comprehension for Vietnamese [0.2446672595462589]
We introduce the UIT-ViWikiQA, the first dataset for evaluating sentence extraction-based machine reading comprehension in Vietnamese language.
The dataset consists of comprises 23.074 question-answers based on 5.109 passages of 174 Vietnamese articles from Wikipedia.
Our experiments show that the best machine model is XLM-R$_Large, which achieves an exact match (EM) score of 85.97% and an F1-score of 88.77% on our dataset.
arXiv Detail & Related papers (2021-05-19T10:22:27Z) - A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese [11.782566169354725]
We present the first public large-scale Text-to-resource semantic parsing dataset for Vietnamese.
We find that automatic Vietnamese word segmentation improves the parsing results of both baselines.
PhoBERT for Vietnamese helps produce higher performances than the recent best multilingual language model XLM-R.
arXiv Detail & Related papers (2020-10-05T09:54:51Z) - A Vietnamese Dataset for Evaluating Machine Reading Comprehension [2.7528170226206443]
We present UIT-ViQuAD, a new dataset for the low-resource language as Vietnamese to evaluate machine reading comprehension models.
This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.
We conduct experiments on state-of-the-art MRC methods for English and Chinese as the first experimental models on UIT-ViQuAD.
arXiv Detail & Related papers (2020-09-30T15:06:56Z) - TICO-19: the Translation Initiative for Covid-19 [112.5601530395345]
The Translation Initiative for COvid-19 (TICO-19) has made test and development data available to AI and MT researchers in 35 different languages.
The same data is translated into all of the languages represented, meaning that testing or development can be done for any pairing of languages in the set.
arXiv Detail & Related papers (2020-07-03T16:26:17Z) - CO-Search: COVID-19 Information Retrieval with Semantic Search, Question
Answering, and Abstractive Summarization [53.67205506042232]
CO-Search is a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature.
To account for the domain-specific and relatively limited dataset, we generate a bipartite graph of document paragraphs and citations.
We evaluate our system on the data of the TREC-COVID information retrieval challenge.
arXiv Detail & Related papers (2020-06-17T01:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.