TiBERT: Tibetan Pre-trained Language Model
- URL: http://arxiv.org/abs/2205.07303v1
- Date: Sun, 15 May 2022 14:45:08 GMT
- Title: TiBERT: Tibetan Pre-trained Language Model
- Authors: Yuan Sun, Sisi Liu, Junjie Deng, Xiaobing Zhao
- Abstract summary: This paper collects the large-scale training data from Tibetan websites and constructs a vocabulary that can cover 99.95$%$ of the words in the corpus by using Sentencepiece.
We apply TiBERT to the downstream tasks of text classification and question generation, and compare it with classic models and multilingual pre-trained models.
- Score: 2.9554549423413303
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The pre-trained language model is trained on large-scale unlabeled text and
can achieve state-of-the-art results in many different downstream tasks.
However, the current pre-trained language model is mainly concentrated in the
Chinese and English fields. For low resource language such as Tibetan, there is
lack of a monolingual pre-trained model. To promote the development of Tibetan
natural language processing tasks, this paper collects the large-scale training
data from Tibetan websites and constructs a vocabulary that can cover 99.95$\%$
of the words in the corpus by using Sentencepiece. Then, we train the Tibetan
monolingual pre-trained language model named TiBERT on the data and vocabulary.
Finally, we apply TiBERT to the downstream tasks of text classification and
question generation, and compare it with classic models and multilingual
pre-trained models, the experimental results show that TiBERT can achieve the
best performance. Our model is published in http://tibert.cmli-nlp.com/
Related papers
- Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Pre-training Data Quality and Quantity for a Low-Resource Language: New
Corpus and BERT Models for Maltese [4.4681678689625715]
We analyse the effect of pre-training with monolingual data for a low-resource language.
We present a newly created corpus for Maltese, and determine the effect that the pre-training data size and domain have on the downstream performance.
We compare two models on the new corpus: a monolingual BERT model trained from scratch (BERTu), and a further pre-trained multilingual BERT (mBERTu)
arXiv Detail & Related papers (2022-05-21T06:44:59Z) - Can Character-based Language Models Improve Downstream Task Performance
in Low-Resource and Noisy Language Scenarios? [0.0]
We focus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi.
We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank leads to performance close to those obtained with the same architecture pre-trained on large multilingual and monolingual models.
arXiv Detail & Related papers (2021-10-26T14:59:16Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z) - Lattice-BERT: Leveraging Multi-Granularity Representations in Chinese
Pre-trained Language Models [62.41139712595334]
We propose a novel pre-training paradigm for Chinese -- Lattice-BERT.
We construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers.
We show that our model can bring an average increase of 1.5% under the 12-layer setting.
arXiv Detail & Related papers (2021-04-15T02:36:49Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Multilingual Translation with Extensible Multilingual Pretraining and
Finetuning [77.33262578776291]
Previous work has demonstrated that machine translation systems can be created by finetuning on bitext.
We show that multilingual translation models can be created through multilingual finetuning.
We demonstrate that pretrained models can be extended to incorporate additional languages without loss of performance.
arXiv Detail & Related papers (2020-08-02T05:36:55Z) - WikiBERT models: deep transfer learning for many languages [1.3455090151301572]
We introduce a simple, fully automated pipeline for creating languagespecific BERT models from Wikipedia data.
We assess the merits of these models using the state-of-the-art UDify on Universal Dependencies data.
arXiv Detail & Related papers (2020-06-02T11:57:53Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z) - Give your Text Representation Models some Love: the Case for Basque [24.76979832867631]
Word embeddings and pre-trained language models allow to build rich representations of text.
Many small companies and research groups tend to use models that have been pre-trained and made available by third parties.
This is suboptimal as, for many languages, the models have been trained on smaller (or lower quality) corpora.
We show that a number of monolingual models trained with larger Basque corpora produce much better results than publicly available versions in downstream NLP tasks.
arXiv Detail & Related papers (2020-03-31T18:01:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.