NorDial: A Preliminary Corpus of Written Norwegian Dialect Use
- URL: http://arxiv.org/abs/2104.04989v1
- Date: Sun, 11 Apr 2021 10:56:53 GMT
- Title: NorDial: A Preliminary Corpus of Written Norwegian Dialect Use
- Authors: Jeremy Barnes and Petter M{\ae}hlum and Samia Touileb
- Abstract summary: We collect a small corpus of tweets and manually annotate them as Bokmaal, Nynorsk, any dialect, or a mix.
We perform preliminary experiments with state-of-the-art models, as well as an analysis of the data to expand this corpus in the future.
- Score: 4.211128681972148
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Norway has a large amount of dialectal variation, as well as a general
tolerance to its use in the public sphere. There are, however, few available
resources to study this variation and its change over time and in more informal
areas, \eg on social media. In this paper, we propose a first step to creating
a corpus of dialectal variation of written Norwegian. We collect a small corpus
of tweets and manually annotate them as Bokm{\aa}l, Nynorsk, any dialect, or a
mix. We further perform preliminary experiments with state-of-the-art models,
as well as an analysis of the data to expand this corpus in the future.
Finally, we make the annotations and models available for future work.
Related papers
- Reddit is all you need: Authorship profiling for Romanian [49.1574468325115]
Authorship profiling is the process of identifying an author's characteristics based on their writings.
In this paper, we introduce a corpus of short texts in the Romanian language, annotated with certain author characteristic keywords.
arXiv Detail & Related papers (2024-10-13T16:27:31Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - A Novel Cartography-Based Curriculum Learning Method Applied on RoNLI: The First Romanian Natural Language Inference Corpus [71.77214818319054]
Natural language inference is a proxy for natural language understanding.
There is no publicly available NLI corpus for the Romanian language.
We introduce the first Romanian NLI corpus (RoNLI) comprising 58K training sentence pairs.
arXiv Detail & Related papers (2024-05-20T08:41:15Z) - Boosting Norwegian Automatic Speech Recognition [0.0]
We present several baselines for automatic speech recognition (ASR) models for the two official written languages in Norway: Bokmaal and Nynorsk.
We compare the performance of models of varying sizes and pre-training approaches on multiple Norwegian speech datasets.
We improve the state of the art on the Norwegian Parliamentary Speech Corpus (NPSC) from a word error rate (WER) of 17.10% to 7.60%, with models achieving 5.81% for Bokmaal and 11.54% for Nynorsk.
arXiv Detail & Related papers (2023-07-04T12:05:15Z) - NoCoLA: The Norwegian Corpus of Linguistic Acceptability [2.538209532048867]
We present two new Norwegian datasets for evaluating language models.
NoCoLA_class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences.
NoCoLA_zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner.
arXiv Detail & Related papers (2023-06-13T14:11:19Z) - Annotating Norwegian Language Varieties on Twitter for Part-of-Speech [14.031720101413557]
We present a novel Norwegian Twitter dataset annotated with POS-tags.
We show that models trained on Universal Dependency (UD) data perform worse when evaluated against this dataset.
We also see that performance on dialectal tweets is comparable to the written standards for some models.
arXiv Detail & Related papers (2022-10-12T12:53:30Z) - NorDiaChange: Diachronic Semantic Change Dataset for Norwegian [63.65426535861836]
NorDiaChange is the first diachronic semantic change dataset for Norwegian.
It covers about 80 Norwegian nouns manually annotated with graded semantic change over time.
arXiv Detail & Related papers (2022-01-13T18:27:33Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Large-Scale Contextualised Language Modelling for Norwegian [7.5722195869569]
This paper introduces the first large-scale monolingual language models for Norwegian, based on both the ELMo and BERT frameworks.
In addition to detailing the training process, we present contrastive benchmark results on a suite of NLP tasks for Norwegian.
arXiv Detail & Related papers (2021-04-13T23:18:04Z) - Learning language variations in news corpora through differential
embeddings [0.0]
We show that a model with a central word representation and a slice-dependent contribution can learn word embeddings from different corpora simultaneously.
We show that it can capture both temporal dynamics in the yearly slices of each corpus, and language variations between US and UK English in a curated multi-source corpus.
arXiv Detail & Related papers (2020-11-13T14:50:08Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.