Related papers: NoCoLA: The Norwegian Corpus of Linguistic Acceptability

NoCoLA: The Norwegian Corpus of Linguistic Acceptability

URL: http://arxiv.org/abs/2306.07790v1
Date: Tue, 13 Jun 2023 14:11:19 GMT
Title: NoCoLA: The Norwegian Corpus of Linguistic Acceptability
Authors: Matias Jentoft and David Samuel
Abstract summary: We present two new Norwegian datasets for evaluating language models. NoCoLA_class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences. NoCoLA_zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner.
Score: 2.538209532048867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While there has been a surge of large language models for Norwegian in recent years, we lack any tool to evaluate their understanding of grammaticality. We present two new Norwegian datasets for this task. NoCoLA_class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences. On the other hand, NoCoLA_zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner, i.e. without any further training. In this paper, we describe both datasets in detail, show how to use them for different flavors of language models, and conduct a comparative study of the existing Norwegian language models.

Related papers

Small Languages, Big Models: A Study of Continual Training on Languages of Norway [11.548845014405984]
Training large language models requires vast amounts of data. We present a novel three-stage continual training approach that substantially improves the downstream performance. We release a new generative language model for Norwegian Bokmral, Nynorsk, and Northern S'ami with 11.4 billion parameters: NorMistral-11B.
arXiv Detail & Related papers (2024-12-09T13:34:23Z)
Small Language Models Like Small Vocabularies: Probing the Linguistic Abilities of Grapheme- and Phoneme-Based Baby Llamas [7.585433383340306]
We show that small models based on the Llama architecture can achieve strong linguistic performance on standard syntactic and novel lexical/phonetic benchmarks. Our findings suggest a promising direction for creating more linguistically plausible language models that are better suited for computational studies of language acquisition and processing.
arXiv Detail & Related papers (2024-10-02T12:36:08Z)
Learning Phonotactics from Linguistic Informants [54.086544221761486]
Our model iteratively selects or synthesizes a data-point according to one of a range of information-theoretic policies. We find that the information-theoretic policies that our model uses to select items to query the informant achieve sample efficiency comparable to, or greater than, fully supervised approaches.
arXiv Detail & Related papers (2024-05-08T00:18:56Z)
NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian [4.062031248854444]
Norwegian, spoken by only 5 million population, is under-representative within the most impressive breakthroughs in NLP tasks. To fill this gap, we compiled the existing Norwegian dataset and pre-trained 4 Norwegian Open Language Models. We find that the mainstream, English-dominated LM GPT-3.5 has limited capability in understanding the Norwegian context.
arXiv Detail & Related papers (2023-12-03T08:09:45Z)
CompoundPiece: Evaluating and Improving Decompounding Performance of Language Models [77.45934004406283]
We systematically study decompounding, the task of splitting compound words into their constituents. We introduce a dataset of 255k compound and non-compound words across 56 diverse languages obtained from Wiktionary. We introduce a novel methodology to train dedicated models for decompounding.
arXiv Detail & Related papers (2023-05-23T16:32:27Z)
Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types. Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z)
Annotating Norwegian Language Varieties on Twitter for Part-of-Speech [14.031720101413557]
We present a novel Norwegian Twitter dataset annotated with POS-tags. We show that models trained on Universal Dependency (UD) data perform worse when evaluated against this dataset. We also see that performance on dialectal tweets is comparable to the written standards for some models.
arXiv Detail & Related papers (2022-10-12T12:53:30Z)
Understanding by Understanding Not: Modeling Negation in Language Models [81.21351681735973]
Negation is a core construction in natural language. We propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences. We reduce the mean top1 error rate to 4% on the negated LAMA dataset.
arXiv Detail & Related papers (2021-05-07T21:58:35Z)
Large-Scale Contextualised Language Modelling for Norwegian [7.5722195869569]
This paper introduces the first large-scale monolingual language models for Norwegian, based on both the ELMo and BERT frameworks. In addition to detailing the training process, we present contrastive benchmark results on a suite of NLP tasks for Norwegian.
arXiv Detail & Related papers (2021-04-13T23:18:04Z)
SLM: Learning a Discourse Language Representation with Sentence Unshuffling [53.42814722621715]
We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation. We show that this feature of our model improves the performance of the original BERT by large margins.
arXiv Detail & Related papers (2020-10-30T13:33:41Z)
Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict. This work shows a comparison of a neural model and character language models with varying amounts on target language data. Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.