LaoPLM: Pre-trained Language Models for Lao
- URL: http://arxiv.org/abs/2110.05896v3
- Date: Thu, 14 Oct 2021 09:01:34 GMT
- Title: LaoPLM: Pre-trained Language Models for Lao
- Authors: Nankai Lin and Yingwen Fu and Chuwei Chen and Ziyu Yang and Shengyi
Jiang
- Abstract summary: Pre-trained language models (PLMs) can capture different levels of concepts in context and hence generate universal language representations.
Although PTMs have been widely used in most NLP applications, it is under-represented in Lao NLP research.
We construct a text classification dataset to alleviate the resource-scare situation of the Lao language.
We present the first transformer-based PTMs for Lao with four versions: BERT-small, BERT-base, ELECTRA-small and ELECTRA-base, and evaluate it over two downstream tasks: part-of-speech tagging and text classification.
- Score: 3.2146309563776416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Trained on the large corpus, pre-trained language models (PLMs) can capture
different levels of concepts in context and hence generate universal language
representations. They can benefit multiple downstream natural language
processing (NLP) tasks. Although PTMs have been widely used in most NLP
applications, especially for high-resource languages such as English, it is
under-represented in Lao NLP research. Previous work on Lao has been hampered
by the lack of annotated datasets and the sparsity of language resources. In
this work, we construct a text classification dataset to alleviate the
resource-scare situation of the Lao language. We additionally present the first
transformer-based PTMs for Lao with four versions: BERT-small, BERT-base,
ELECTRA-small and ELECTRA-base, and evaluate it over two downstream tasks:
part-of-speech tagging and text classification. Experiments demonstrate the
effectiveness of our Lao models. We will release our models and datasets to the
community, hoping to facilitate the future development of Lao NLP applications.
Related papers
- Language Portability Strategies for Open-domain Dialogue with Pre-trained Language Models from High to Low Resource Languages [1.7436854281619139]
We propose a study of linguistic portability strategies of large pre-trained language models (PLMs) used for open-domain dialogue systems.
In particular the target low-resource language (L_T) will be simulated with French, as it lacks of task-specific resources.
arXiv Detail & Related papers (2024-07-01T14:20:54Z) - CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models [59.91221728187576]
This paper introduces the CMU Linguistic Linguistic Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models.
CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages.
arXiv Detail & Related papers (2024-04-03T02:21:46Z) - Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages.
Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z) - MoSECroT: Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer [50.40191599304911]
We introduce MoSECroT Model Stitching with Static Word Embeddings for Crosslingual Zero-shot Transfer.
In this paper, we present the first framework that leverages relative representations to construct a common space for the embeddings of a source language PLM and the static word embeddings of a target language.
We show that although our proposed framework is competitive with weak baselines when addressing MoSECroT, it fails to achieve competitive results compared with some strong baselines.
arXiv Detail & Related papers (2024-01-09T21:09:07Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Annotated Dataset Creation through General Purpose Language Models for
non-English Medical NLP [0.5482532589225552]
In our work we suggest to leverage pretrained language models for training data acquisition.
We create a custom dataset which we use to train a medical NER model for German texts, GPTNERMED.
arXiv Detail & Related papers (2022-08-30T18:42:55Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - Can Character-based Language Models Improve Downstream Task Performance
in Low-Resource and Noisy Language Scenarios? [0.0]
We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi.
We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
arXiv Detail & Related papers (2021-10-26T14:59:16Z) - BanglaBERT: Combating Embedding Barrier for Low-Resource Language
Understanding [1.7000879291900044]
We build a Bangla natural language understanding model pre-trained on 18.6 GB data we crawled from top Bangla sites on the internet.
Our model outperforms multilingual baselines and previous state-of-the-art results by 1-6%.
We identify a major shortcoming of multilingual models that hurt performance for low-resource languages that don't share writing scripts with any high resource one.
arXiv Detail & Related papers (2021-01-01T09:28:45Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.