Operationalizing a National Digital Library: The Case for a Norwegian
Transformer Model
- URL: http://arxiv.org/abs/2104.09617v1
- Date: Mon, 19 Apr 2021 20:36:24 GMT
- Title: Operationalizing a National Digital Library: The Case for a Norwegian
Transformer Model
- Authors: Per E Kummervold, Javier de la Rosa, Freddy Wetjen, Svein Arne
Brygfjeld
- Abstract summary: We show the process of building a large-scale training set from digital and digitized collections at a national library.
The resulting Bidirectional Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we show the process of building a large-scale training set from
digital and digitized collections at a national library. The resulting
Bidirectional Encoder Representations from Transformers (BERT)-based language
model for Norwegian outperforms multilingual BERT (mBERT) models in several
token and sequence classification tasks for both Norwegian Bokm{\aa}l and
Norwegian Nynorsk. Our model also improves the mBERT performance for other
languages present in the corpus such as English, Swedish, and Danish. For
languages not included in the corpus, the weights degrade moderately while
keeping strong multilingual properties. Therefore, we show that building
high-quality models within a memory institution using somewhat noisy optical
character recognition (OCR) content is feasible, and we hope to pave the way
for other memory institutions to follow.
Related papers
- Accidental Learners: Spoken Language Identification in Multilingual
Self-Supervised Models [11.439430077017635]
We find that pre-trained speech models optimally encode language discriminatory information in lower layers.
We demonstrate that the embeddings obtained from these layers are significantly robust to classify unseen languages.
We open-source the model through the NVIDIA NeMo toolkit.
arXiv Detail & Related papers (2022-11-09T18:53:59Z) - A Warm Start and a Clean Crawled Corpus -- A Recipe for Good Language
Models [0.0]
We train several language models for Icelandic, including IceBERT, that achieve state-of-the-art performance in a variety of downstream tasks.
We introduce a new corpus of Icelandic text, the Icelandic Common Crawl Corpus (IC3), a collection of high quality texts found online by targeting the Icelandic top-level-domain (TLD)
We show that a properly cleaned crawled corpus is sufficient to achieve state-of-the-art results in NLP applications for low to medium resource languages.
arXiv Detail & Related papers (2022-01-14T18:45:31Z) - Breaking Down Multilingual Machine Translation [74.24795388967907]
We show that multilingual training is beneficial to encoders in general, while it only benefits decoders for low-resource languages (LRLs)
Our many-to-one models for high-resource languages and one-to-many models for LRLs outperform the best results reported by Aharoni et al.
arXiv Detail & Related papers (2021-10-15T14:57:12Z) - FBERT: A Neural Transformer for Identifying Offensive Content [67.12838911384024]
fBERT is a BERT model retrained on SOLID, the largest English offensive language identification corpus available with over $1.4$ million offensive instances.
We evaluate fBERT's performance on identifying offensive content on multiple English datasets and we test several thresholds for selecting instances from SOLID.
The fBERT model will be made freely available to the community.
arXiv Detail & Related papers (2021-09-10T19:19:26Z) - Cross-lingual Transferring of Pre-trained Contextualized Language Models [73.97131976850424]
We propose a novel cross-lingual model transferring framework for PrLMs: TreLM.
To handle the symbol order and sequence length differences between languages, we propose an intermediate TRILayer" structure.
We show the proposed framework significantly outperforms language models trained from scratch with limited data in both performance and efficiency.
arXiv Detail & Related papers (2021-07-27T06:51:13Z) - Large-Scale Contextualised Language Modelling for Norwegian [7.5722195869569]
This paper introduces the first large-scale monolingual language models for Norwegian, based on both the ELMo and BERT frameworks.
In addition to detailing the training process, we present contrastive benchmark results on a suite of NLP tasks for Norwegian.
arXiv Detail & Related papers (2021-04-13T23:18:04Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - Learning Contextualised Cross-lingual Word Embeddings and Alignments for
Extremely Low-Resource Languages Using Parallel Corpora [63.5286019659504]
We propose a new approach for learning contextualised cross-lingual word embeddings based on a small parallel corpus.
Our method obtains word embeddings via an LSTM encoder-decoder model that simultaneously translates and reconstructs an input sentence.
arXiv Detail & Related papers (2020-10-27T22:24:01Z) - Playing with Words at the National Library of Sweden -- Making a Swedish
BERT [0.0]
This paper introduces the Swedish BERT ("KB-BERT") developed by the KBLab for data-driven research at the National Library of Sweden (KB)
Building on recent efforts to create transformer-based BERT models for languages other than English, we explain how we used KB's collections to create and train a new language-specific BERT model for Swedish.
arXiv Detail & Related papers (2020-07-03T12:53:39Z) - ParsBERT: Transformer-based Model for Persian Language Understanding [0.7646713951724012]
This paper proposes a monolingual BERT for the Persian language (ParsBERT)
It shows its state-of-the-art performance compared to other architectures and multilingual models.
ParsBERT obtains higher scores in all datasets, including existing ones as well as composed ones.
arXiv Detail & Related papers (2020-05-26T05:05:32Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.