PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech
tagging, named entity recognition and dependency parsing
- URL: http://arxiv.org/abs/2101.01476v2
- Date: Thu, 8 Apr 2021 17:31:16 GMT
- Title: PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech
tagging, named entity recognition and dependency parsing
- Authors: Linh The Nguyen, Dat Quoc Nguyen
- Abstract summary: We present the first multi-task learning model -- named PhoNLP -- for joint Vietnamese part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing.
Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results.
- Score: 8.558842542068778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present the first multi-task learning model -- named PhoNLP -- for joint
Vietnamese part-of-speech (POS) tagging, named entity recognition (NER) and
dependency parsing. Experiments on Vietnamese benchmark datasets show that
PhoNLP produces state-of-the-art results, outperforming a single-task learning
approach that fine-tunes the pre-trained Vietnamese language model PhoBERT
(Nguyen and Nguyen, 2020) for each task independently. We publicly release
PhoNLP as an open-source toolkit under the Apache License 2.0. Although we
specify PhoNLP for Vietnamese, our PhoNLP training and evaluation command
scripts in fact can directly work for other languages that have a pre-trained
BERT-based language model and gold annotated corpora available for the three
tasks of POS tagging, NER and dependency parsing. We hope that PhoNLP can serve
as a strong baseline and useful toolkit for future NLP research and
applications to not only Vietnamese but also the other languages. Our PhoNLP is
available at: https://github.com/VinAIResearch/PhoNLP
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text
Processing [1.1765925931670576]
We present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT.
Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks.
arXiv Detail & Related papers (2023-10-17T11:34:50Z) - Binding Language Models in Symbolic Languages [146.3027328556881]
Binder is a training-free neural-symbolic framework that maps the task input to a program.
In the parsing stage, Codex is able to identify the part of the task input that cannot be answerable by the original programming language.
In the execution stage, Codex can perform versatile functionalities given proper prompts in the API calls.
arXiv Detail & Related papers (2022-10-06T12:55:17Z) - COVID-19 Named Entity Recognition for Vietnamese [6.17059264011429]
We present the first manually-annotated COVID-19 domain-specific dataset for Vietnamese.
Our dataset is annotated for the named entity recognition task with newly-defined entity types.
Our dataset also contains the largest number of entities compared to existing Vietnamese NER datasets.
arXiv Detail & Related papers (2021-04-08T16:35:34Z) - CPM: A Large-scale Generative Chinese Pre-trained Language Model [76.65305358932393]
We release the Chinese Pre-trained Language Model (CPM) with generative pre-training on large-scale Chinese training data.
CPM achieves strong performance on many NLP tasks in the settings of few-shot (even zero-shot) learning.
arXiv Detail & Related papers (2020-12-01T11:32:56Z) - A Pilot Study of Text-to-SQL Semantic Parsing for Vietnamese [11.782566169354725]
We present the first public large-scale Text-to-resource semantic parsing dataset for Vietnamese.
We find that automatic Vietnamese word segmentation improves the parsing results of both baselines.
PhoBERT for Vietnamese helps produce higher performances than the recent best multilingual language model XLM-R.
arXiv Detail & Related papers (2020-10-05T09:54:51Z) - N-LTP: An Open-source Neural Language Technology Platform for Chinese [68.58732970171747]
textttN- is an open-source neural language technology platform supporting six fundamental Chinese NLP tasks.
textttN- adopts the multi-task framework by using a shared pre-trained model, which has the advantage of capturing the shared knowledge across relevant Chinese tasks.
arXiv Detail & Related papers (2020-09-24T11:45:39Z) - FewJoint: A Few-shot Learning Benchmark for Joint Language Understanding [55.38905499274026]
Few-shot learning is one of the key future steps in machine learning.
FewJoint is a novel Few-Shot Learning benchmark for NLP.
arXiv Detail & Related papers (2020-09-17T08:17:12Z) - PhoBERT: Pre-trained language models for Vietnamese [11.685916685552982]
We present PhoBERT, the first public large-scale monolingual language models pre-trained for Vietnamese.
Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R.
We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP.
arXiv Detail & Related papers (2020-03-02T10:21:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.