ViDeBERTa: A powerful pre-trained language model for Vietnamese
- URL: http://arxiv.org/abs/2301.10439v1
- Date: Wed, 25 Jan 2023 07:26:54 GMT
- Title: ViDeBERTa: A powerful pre-trained language model for Vietnamese
- Authors: Cong Dao Tran, Nhut Huy Pham, Anh Nguyen, Truong Son Hy, Tu Vu
- Abstract summary: This paper presents ViDeBERTa, a new pre-trained monolingual language model for Vietnamese.
Three versions - ViDeBERTa_xsmall, ViDeBERTa_base, and ViDeBERTa_large - are pre-trained on a large-scale corpus of high-quality and diverse Vietnamese texts.
We fine-tune and evaluate our model on three important natural language downstream tasks, Part-of-speech tagging, Named-entity recognition, and Question answering.
- Score: 10.000783498978604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents ViDeBERTa, a new pre-trained monolingual language model
for Vietnamese, with three versions - ViDeBERTa_xsmall, ViDeBERTa_base, and
ViDeBERTa_large, which are pre-trained on a large-scale corpus of high-quality
and diverse Vietnamese texts using DeBERTa architecture. Although many
successful pre-trained language models based on Transformer have been widely
proposed for the English language, there are still few pre-trained models for
Vietnamese, a low-resource language, that perform good results on downstream
tasks, especially Question answering. We fine-tune and evaluate our model on
three important natural language downstream tasks, Part-of-speech tagging,
Named-entity recognition, and Question answering. The empirical results
demonstrate that ViDeBERTa with far fewer parameters surpasses the previous
state-of-the-art models on multiple Vietnamese-specific natural language
understanding tasks. Notably, ViDeBERTa_base with 86M parameters, which is only
about 23% of PhoBERT_large with 370M parameters, still performs the same or
better results than the previous state-of-the-art model. Our ViDeBERTa models
are available at: https://github.com/HySonLab/ViDeBERTa.
Related papers
- Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation [82.5217996570387]
We adapt a pre-trained language model for auto-regressive text-to-image generation.
We find that pre-trained language models offer limited help.
arXiv Detail & Related papers (2023-11-27T07:19:26Z) - ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text
Processing [1.1765925931670576]
We present the first monolingual pre-trained language model for Vietnamese social media texts, ViSoBERT.
Our experiments demonstrate that ViSoBERT, with far fewer parameters, surpasses the previous state-of-the-art models on multiple Vietnamese social media tasks.
arXiv Detail & Related papers (2023-10-17T11:34:50Z) - M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for
Multilingual Speech to Image Retrieval [56.49878599920353]
This work investigates the use of large-scale, English-only pre-trained models (CLIP and HuBERT) for multilingual image-speech retrieval.
For non-English image-speech retrieval, we outperform the current state-of-the-art performance by a wide margin both when training separate models for each language, and with a single model which processes speech in all three languages.
arXiv Detail & Related papers (2022-11-02T14:54:45Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Bidirectional Language Models Are Also Few-shot Learners [54.37445173284831]
We present SAP (Sequential Autoregressive Prompting), a technique that enables the prompting of bidirectional models.
We show SAP is effective on question answering and summarization.
For the first time, our results demonstrate prompt-based learning is an emergent property of a broader class of language models.
arXiv Detail & Related papers (2022-09-29T01:35:57Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - Training Language Models with Natural Language Feedback [51.36137482891037]
We learn from language feedback on model outputs using a three-step learning algorithm.
In synthetic experiments, we first evaluate whether language models accurately incorporate feedback to produce refinements.
Using only 100 samples of human-written feedback, our learning algorithm finetunes a GPT-3 model to roughly human-level summarization.
arXiv Detail & Related papers (2022-04-29T15:06:58Z) - From Universal Language Model to Downstream Task: Improving
RoBERTa-Based Vietnamese Hate Speech Detection [8.602181445598776]
We propose a pipeline to adapt the general-purpose RoBERTa language model to a specific text classification task: Vietnamese Hate Speech Detection.
Our experiments proved that our proposed pipeline boosts the performance significantly, achieving a new state-of-the-art on Vietnamese Hate Speech Detection campaign with 0.7221 F1 score.
arXiv Detail & Related papers (2021-02-24T09:30:55Z) - WikiBERT models: deep transfer learning for many languages [1.3455090151301572]
We introduce a simple, fully automated pipeline for creating languagespecific BERT models from Wikipedia data.
We assess the merits of these models using the state-of-the-art UDify on Universal Dependencies data.
arXiv Detail & Related papers (2020-06-02T11:57:53Z) - Exploring Versatile Generative Language Model Via Parameter-Efficient
Transfer Learning [70.81910984985683]
We propose an effective way to fine-tune multiple down-stream generation tasks simultaneously using a single, large pre-trained model.
The experiments on five diverse language generation tasks show that by just using an additional 2-3% parameters for each task, our model can maintain or even improve the performance of fine-tuning the whole model.
arXiv Detail & Related papers (2020-04-08T06:18:44Z) - PhoBERT: Pre-trained language models for Vietnamese [11.685916685552982]
We present PhoBERT, the first public large-scale monolingual language models pre-trained for Vietnamese.
Experimental results show that PhoBERT consistently outperforms the recent best pre-trained multilingual model XLM-R.
We release PhoBERT to facilitate future research and downstream applications for Vietnamese NLP.
arXiv Detail & Related papers (2020-03-02T10:21:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.