PhoBERT: Pre-trained language models for Vietnamese
- URL: http://arxiv.org/abs/2003.00744v3
- Date: Mon, 5 Oct 2020 09:53:19 GMT
- Title: PhoBERT: Pre-trained language models for Vietnamese
- Authors: Dat Quoc Nguyen and Anh Tuan Nguyen
- Abstract summary: We present PhoBERT, the first public large-scale monolingual language models pre-trained for Vietnamese.
Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R.
We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP.
- Score: 11.685916685552982
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present PhoBERT with two versions, PhoBERT-base and PhoBERT-large, the
first public large-scale monolingual language models pre-trained for
Vietnamese. Experimental results show that PhoBERT consistently outperforms the
recent best pre-trained multilingual model XLM-R (Conneau et al., 2020) and
improves the state-of-the-art in multiple Vietnamese-specific NLP tasks
including Part-of-speech tagging, Dependency parsing, Named-entity recognition
and Natural language inference. We release PhoBERT to facilitate future
research and downstream applications for Vietnamese NLP. Our PhoBERT models are
available at https://github.com/VinAIResearch/PhoBERT
Related papers
- VLUE: A New Benchmark and Multi-task Knowledge Transfer Learning for Vietnamese Natural Language Understanding [1.813644606477824]
We introduce the first Vietnamese Language Understanding Evaluation (VLUE) benchmark.
The VLUE benchmark encompasses five datasets covering different NLU tasks, including text classification, span extraction, and natural language understanding.
We present CafeBERT, a new state-of-the-art pre-trained model that achieves superior results across all tasks in the VLUE benchmark.
arXiv Detail & Related papers (2024-03-23T16:26:49Z) - ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text
Processing [1.1765925931670576]
We present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT.
Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks.
arXiv Detail & Related papers (2023-10-17T11:34:50Z) - Learning Cross-lingual Visual Speech Representations [108.68531445641769]
Cross-lingual self-supervised visual representation learning has been a growing research topic in the last few years.
We use the recently-proposed Raw Audio-Visual Speechs (RAVEn) framework to pre-train an audio-visual model with unlabelled data.
Our experiments show that: (1) multi-lingual models with more data outperform monolingual ones, but, when keeping the amount of data fixed, monolingual models tend to reach better performance.
arXiv Detail & Related papers (2023-03-14T17:05:08Z) - ViDeBERTa: A powerful pre-trained language model for Vietnamese [10.000783498978604]
This paper presents ViDeBERTa, a new pre-trained monolingual language model for Vietnamese.
Three versions - ViDeBERTa_xsmall, ViDeBERTa_base, and ViDeBERTa_large - are pre-trained on a large-scale corpus of high-quality and diverse Vietnamese texts.
We fine-tune and evaluate our model on three important natural language downstream tasks, Part-of-speech tagging, Named-entity recognition, and Question answering.
arXiv Detail & Related papers (2023-01-25T07:26:54Z) - LERT: A Linguistically-motivated Pre-trained Language Model [67.65651497173998]
We propose LERT, a pre-trained language model that is trained on three types of linguistic features along with the original pre-training task.
We carried out extensive experiments on ten Chinese NLU tasks, and the experimental results show that LERT could bring significant improvements.
arXiv Detail & Related papers (2022-11-10T05:09:16Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - From FreEM to D'AlemBERT: a Large Corpus and a Language Model for Early
Modern French [57.886210204774834]
We present our efforts to develop NLP tools for Early Modern French (historical French from the 16$textth$ to the 18$textth$ centuries).
We present the $textFreEM_textmax$ corpus of Early Modern French and D'AlemBERT, a RoBERTa-based language model trained on $textFreEM_textmax$.
arXiv Detail & Related papers (2022-02-18T22:17:22Z) - BARTpho: Pre-trained Sequence-to-Sequence Models for Vietnamese [5.955739135932037]
We present BARTpho, the first public large-scale monolingual sequence-to-sequence models pre-trained for Vietnamese.
Our BARTpho uses the "large" architecture and pre-training scheme of the sequence-to-sequence denoising model BART.
Experiments on a downstream task of Vietnamese text summarization show that our BARTpho outperforms the strong baseline mBART.
arXiv Detail & Related papers (2021-09-20T17:14:22Z) - PhoNLP: A joint multi-task learning model for Vietnamese part-of-speech
tagging, named entity recognition and dependency parsing [8.558842542068778]
We present the first multi-task learning model -- named PhoNLP -- for joint Vietnamese part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing.
Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results.
arXiv Detail & Related papers (2021-01-05T12:13:09Z) - Revisiting Pre-Trained Models for Chinese Natural Language Processing [73.65780892128389]
We revisit Chinese pre-trained language models to examine their effectiveness in a non-English language.
We also propose a model called MacBERT, which improves upon RoBERTa in several ways.
arXiv Detail & Related papers (2020-04-29T02:08:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.