NorDiaChange: Diachronic Semantic Change Dataset for Norwegian
- URL: http://arxiv.org/abs/2201.05123v1
- Date: Thu, 13 Jan 2022 18:27:33 GMT
- Title: NorDiaChange: Diachronic Semantic Change Dataset for Norwegian
- Authors: Andrey Kutuzov, Samia Touileb, Petter M{\ae}hlum, Tita Ranveig Enstad,
Alexandra Wittemann
- Abstract summary: NorDiaChange is the first diachronic semantic change dataset for Norwegian.
It covers about 80 Norwegian nouns manually annotated with graded semantic change over time.
- Score: 63.65426535861836
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We describe NorDiaChange: the first diachronic semantic change dataset for
Norwegian. NorDiaChange comprises two novel subsets, covering about 80
Norwegian nouns manually annotated with graded semantic change over time. Both
datasets follow the same annotation procedure and can be used interchangeably
as train and test splits for each other. NorDiaChange covers the time periods
related to pre- and post-war events, oil and gas discovery in Norway, and
technological developments. The annotation was done using the DURel framework
and two large historical Norwegian corpora. NorDiaChange is published in full
under a permissive license, complete with raw annotation data and inferred
diachronic word usage graphs (DWUGs).
Related papers
- NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian [4.062031248854444]
Norwegian, spoken by only 5 million population, is under-representative within the most impressive breakthroughs in NLP tasks.
To fill this gap, we compiled the existing Norwegian dataset and pre-trained 4 Norwegian Open Language Models.
We find that the mainstream, English-dominated LM GPT-3.5 has limited capability in understanding the Norwegian context.
arXiv Detail & Related papers (2023-12-03T08:09:45Z) - NoCoLA: The Norwegian Corpus of Linguistic Acceptability [2.538209532048867]
We present two new Norwegian datasets for evaluating language models.
NoCoLA_class is a supervised binary classification task where the goal is to discriminate between acceptable and non-acceptable sentences.
NoCoLA_zero is a purely diagnostic task for evaluating the grammatical judgement of a language model in a completely zero-shot manner.
arXiv Detail & Related papers (2023-06-13T14:11:19Z) - Aligning the Norwegian UD Treebank with Entity and Coreference
Information [0.0]
This paper presents a merged collection of entity and coreference annotated data grounded in the Universal Dependencies (UD) treebanks for the two written forms of Norwegian: Bokmaal and Nynorsk.
The aligned and converted corpora are the Norwegian Named Entities (NorNE) and Norwegian Anaphora Resolution Corpus (NARC)
arXiv Detail & Related papers (2023-05-22T22:44:53Z) - Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization [80.94424037751243]
In zero-shot multilingual extractive text summarization, a model is typically trained on English dataset and then applied on summarization datasets of other languages.
We propose NLS (Neural Label Search for Summarization), which jointly learns hierarchical weights for different sets of labels together with our summarization model.
We conduct multilingual zero-shot summarization experiments on MLSUM and WikiLingua datasets, and we achieve state-of-the-art results using both human and automatic evaluations.
arXiv Detail & Related papers (2022-04-28T14:02:16Z) - Compositional Temporal Grounding with Structured Variational Cross-Graph
Correspondence Learning [92.07643510310766]
Temporal grounding in videos aims to localize one target video segment that semantically corresponds to a given query sentence.
We introduce a new Compositional Temporal Grounding task and construct two new dataset splits.
We empirically find that they fail to generalize to queries with novel combinations of seen words.
We propose a variational cross-graph reasoning framework that explicitly decomposes video and language into multiple structured hierarchies.
arXiv Detail & Related papers (2022-03-24T12:55:23Z) - Three-part diachronic semantic change dataset for Russian [4.7566046630595755]
We present a manually annotated lexical semantic change dataset for Russian: RuShiftEval.
Its novelty is ensured by a single set of target words annotated for their diachronic semantic shifts across three time periods.
arXiv Detail & Related papers (2021-06-15T17:12:25Z) - NorDial: A Preliminary Corpus of Written Norwegian Dialect Use [4.211128681972148]
We collect a small corpus of tweets and manually annotate them as Bokmaal, Nynorsk, any dialect, or a mix.
We perform preliminary experiments with state-of-the-art models, as well as an analysis of the data to expand this corpus in the future.
arXiv Detail & Related papers (2021-04-11T10:56:53Z) - Local Additivity Based Data Augmentation for Semi-supervised NER [59.90773003737093]
Named Entity Recognition (NER) is one of the first stages in deep language understanding.
Current NER models heavily rely on human-annotated data.
We propose a Local Additivity based Data Augmentation (LADA) method for semi-supervised NER.
arXiv Detail & Related papers (2020-10-04T20:46:26Z) - DART: Open-Domain Structured Data Record to Text Generation [91.23798751437835]
We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs)
We propose a procedure of extracting semantic triples from tables that encode their structures by exploiting the semantic dependencies among table headers and the table title.
Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and dialogue-act-based meaning representation tasks.
arXiv Detail & Related papers (2020-07-06T16:35:30Z) - e-SNLI-VE: Corrected Visual-Textual Entailment with Natural Language
Explanations [87.71914254873857]
We present a data collection effort to correct the class with the highest error rate in SNLI-VE.
Thirdly, we introduce e-SNLI-VE, which appends human-written natural language explanations to SNLI-VE.
We train models that learn from these explanations at training time, and output such explanations at testing time.
arXiv Detail & Related papers (2020-04-07T23:12:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.