gaBERT -- an Irish Language Model
- URL: http://arxiv.org/abs/2107.12930v2
- Date: Wed, 28 Jul 2021 08:20:27 GMT
- Title: gaBERT -- an Irish Language Model
- Authors: James Barry, Joachim Wagner, Lauren Cassidy, Alan Cowap, Teresa Lynn,
Abigail Walsh, M\'iche\'al J. \'O Meachair, Jennifer Foster
- Abstract summary: gaBERT is a monolingual BERT model for the Irish language.
We show how different filtering criteria, vocabulary size and the choice of subword tokenisation model affect downstream performance.
- Score: 7.834915319072005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The BERT family of neural language models have become highly popular due to
their ability to provide sequences of text with rich context-sensitive token
encodings which are able to generalise well to many Natural Language Processing
tasks. Over 120 monolingual BERT models covering over 50 languages have been
released, as well as a multilingual model trained on 104 languages. We
introduce, gaBERT, a monolingual BERT model for the Irish language. We compare
our gaBERT model to multilingual BERT and show that gaBERT provides better
representations for a downstream parsing task. We also show how different
filtering criteria, vocabulary size and the choice of subword tokenisation
model affect downstream performance. We release gaBERT and related code to the
community.
Related papers
- PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances.
We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID.
The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Towards Fully Bilingual Deep Language Modeling [1.3455090151301572]
We consider whether it is possible to pre-train a bilingual model for two remotely related languages without compromising performance at either language.
We create a Finnish-English bilingual BERT model and evaluate its performance on datasets used to evaluate the corresponding monolingual models.
Our bilingual model performs on par with Google's original English BERT on GLUE and nearly matches the performance of monolingual Finnish BERT on a range of Finnish NLP tasks.
arXiv Detail & Related papers (2020-10-22T12:22:50Z) - Looking for Clues of Language in Multilingual BERT to Improve
Cross-lingual Generalization [56.87201892585477]
Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information.
We control the output languages of multilingual BERT by manipulating the token embeddings.
arXiv Detail & Related papers (2020-10-20T05:41:35Z) - It's not Greek to mBERT: Inducing Word-Level Translations from
Multilingual BERT [54.84185432755821]
multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages.
We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning.
arXiv Detail & Related papers (2020-10-16T09:49:32Z) - Evaluating Multilingual BERT for Estonian [0.8057006406834467]
We evaluate four multilingual models -- multilingual BERT, multilingual distilled BERT, XLM and XLM-RoBERTa -- on several NLP tasks.
Our results show that multilingual BERT models can generalise well on different Estonian NLP tasks.
arXiv Detail & Related papers (2020-10-01T14:48:31Z) - CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot
Cross-Lingual NLP [68.2650714613869]
We propose a data augmentation framework to generate multi-lingual code-switching data to fine-tune mBERT.
Compared with the existing work, our method does not rely on bilingual sentences for training, and requires only one training process for multiple target languages.
arXiv Detail & Related papers (2020-06-11T13:15:59Z) - WikiBERT models: deep transfer learning for many languages [1.3455090151301572]
We introduce a simple, fully automated pipeline for creating languagespecific BERT models from Wikipedia data.
We assess the merits of these models using the state-of-the-art UDify on Universal Dependencies data.
arXiv Detail & Related papers (2020-06-02T11:57:53Z) - What the [MASK]? Making Sense of Language-Specific BERT Models [39.54532211263058]
This paper presents the current state of the art in language-specific BERT models.
Our aim is to provide an overview of the commonalities and differences between Language-language-specific BERT models and mBERT models.
arXiv Detail & Related papers (2020-03-05T20:42:51Z) - RobBERT: a Dutch RoBERTa-based Language Model [9.797319790710711]
We use RoBERTa to train a Dutch language model called RobBERT.
We measure its performance on various tasks as well as the importance of the fine-tuning dataset size.
RobBERT improves state-of-the-art results for various tasks, and especially significantly outperforms other models when dealing with smaller datasets.
arXiv Detail & Related papers (2020-01-17T13:25:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.