Large-Scale Contextualised Language Modelling for Norwegian
- URL: http://arxiv.org/abs/2104.06546v1
- Date: Tue, 13 Apr 2021 23:18:04 GMT
- Title: Large-Scale Contextualised Language Modelling for Norwegian
- Authors: Andrey Kutuzov, Jeremy Barnes, Erik Velldal, Lilja {\O}vrelid, Stephan
Oepen
- Abstract summary: This paper introduces the first large-scale monolingual language models for Norwegian, based on both the ELMo and BERT frameworks.
In addition to detailing the training process, we present contrastive benchmark results on a suite of NLP tasks for Norwegian.
- Score: 7.5722195869569
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present the ongoing NorLM initiative to support the creation and use of
very large contextualised language models for Norwegian (and in principle other
Nordic languages), including a ready-to-use software environment, as well as an
experience report for data preparation and training. This paper introduces the
first large-scale monolingual language models for Norwegian, based on both the
ELMo and BERT frameworks. In addition to detailing the training process, we
present contrastive benchmark results on a suite of NLP tasks for Norwegian.
For additional background and access to the data, models, and software, please
see http://norlm.nlpl.eu
Related papers
- A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian [4.062031248854444]
Norwegian, spoken by only 5 million population, is under-representative within the most impressive breakthroughs in NLP tasks.
To fill this gap, we compiled the existing Norwegian dataset and pre-trained 4 Norwegian Open Language Models.
We find that the mainstream, English-dominated LM GPT-3.5 has limited capability in understanding the Norwegian context.
arXiv Detail & Related papers (2023-12-03T08:09:45Z) - FinGPT: Large Generative Models for a Small Language [48.46240937758779]
We create large language models (LLMs) for Finnish, a language spoken by less than 0.1% of the world population.
We train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT.
We continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI.
arXiv Detail & Related papers (2023-11-03T08:05:04Z) - Cross-Lingual NER for Financial Transaction Data in Low-Resource
Languages [70.25418443146435]
We propose an efficient modeling framework for cross-lingual named entity recognition in semi-structured text data.
We employ two independent datasets of SMSs in English and Arabic, each carrying semi-structured banking transaction information.
With access to only 30 labeled samples, our model can generalize the recognition of merchants, amounts, and other fields from English to Arabic.
arXiv Detail & Related papers (2023-07-16T00:45:42Z) - NoCoLA: The Norwegian Corpus of Linguistic Acceptability [2.538209532048867]
We present two new Norwegian datasets for evaluating language models.
NoCoLA_class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences.
NoCoLA_zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner.
arXiv Detail & Related papers (2023-06-13T14:11:19Z) - NorBench -- A Benchmark for Norwegian Language Models [7.395163289937936]
We present NorBench: a suite of NLP tasks and probes for evaluating Norwegian language models (LMs) on standardized data splits and evaluation metrics.
We also introduce a range of new Norwegian language models (both encoder and encoder-decoder based)
We compare and analyze their performance, along with other existing LMs, across the different benchmark tests of NorBench.
arXiv Detail & Related papers (2023-05-06T00:20:24Z) - Transfer to a Low-Resource Language via Close Relatives: The Case Study
on Faroese [54.00582760714034]
Cross-lingual NLP transfer can be improved by exploiting data and models of high-resource languages.
We release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS) and new language models trained on all Scandinavian languages.
arXiv Detail & Related papers (2023-04-18T08:42:38Z) - The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling [5.687459576800633]
We curate a high-quality dataset consisting of 1.2TB of text in all of the major North Germanic languages.
This paper details our considerations and processes for collecting, cleaning, and filtering the dataset.
arXiv Detail & Related papers (2023-03-30T06:42:22Z) - Generalizing Multimodal Pre-training into Multilingual via Language
Acquisition [54.69707237195554]
English-based Vision-Language Pre-training has achieved great success in various downstream tasks.
Some efforts have been taken to generalize this success to non-English languages through Multilingual Vision-Language Pre-training.
We propose a textbfMultitextbfLingual textbfAcquisition (MLA) framework that can easily generalize a monolingual Vision-Language Pre-training model into multilingual.
arXiv Detail & Related papers (2022-05-29T08:53:22Z) - Operationalizing a National Digital Library: The Case for a Norwegian
Transformer Model [0.0]
We show the process of building a large-scale training set from digital and digitized collections at a national library.
The resulting Bidirectional Representations from Transformers (BERT)-based language model for Norwegian outperforms multilingual BERT (mBERT) models in several token and sequence classification tasks.
arXiv Detail & Related papers (2021-04-19T20:36:24Z) - Reusing a Pretrained Language Model on Languages with Limited Corpora
for Unsupervised NMT [129.99918589405675]
We present an effective approach that reuses an LM that is pretrained only on the high-resource language.
The monolingual LM is fine-tuned on both languages and is then used to initialize a UNMT model.
Our approach, RE-LM, outperforms a competitive cross-lingual pretraining model (XLM) in English-Macedonian (En-Mk) and English-Albanian (En-Sq)
arXiv Detail & Related papers (2020-09-16T11:37:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.