Aligning the Norwegian UD Treebank with Entity and Coreference
Information
- URL: http://arxiv.org/abs/2305.13527v2
- Date: Thu, 25 May 2023 22:36:36 GMT
- Title: Aligning the Norwegian UD Treebank with Entity and Coreference
Information
- Authors: Tollef Emil J{\o}rgensen and Andre K{\aa}sen
- Abstract summary: This paper presents a merged collection of entity and coreference annotated data grounded in the Universal Dependencies (UD) treebanks for the two written forms of Norwegian: Bokmaal and Nynorsk.
The aligned and converted corpora are the Norwegian Named Entities (NorNE) and Norwegian Anaphora Resolution Corpus (NARC)
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: This paper presents a merged collection of entity and coreference annotated
data grounded in the Universal Dependencies (UD) treebanks for the two written
forms of Norwegian: Bokm{\aa}l and Nynorsk. The aligned and converted corpora
are the Norwegian Named Entities (NorNE) and Norwegian Anaphora Resolution
Corpus (NARC). While NorNE is aligned with an older version of the treebank,
NARC is misaligned and requires extensive transformation from the original
annotations to the UD structure and CoNLL-U format. We here demonstrate the
conversion and alignment processes, along with an analysis of discovered issues
and errors in the data - some of which include data split overlaps in the
original treebank. These procedures and the developed system may prove helpful
for future corpus alignment and coreference annotation endeavors. The merged
corpora comprise the first Norwegian UD treebank enriched with named entities
and coreference information.
Related papers
- Structured Dialogue Discourse Parsing [79.37200787463917]
discourse parsing aims to uncover the internal structure of a multi-participant conversation.
We propose a principled method that improves upon previous work from two perspectives: encoding and decoding.
Experiments show that our method achieves new state-of-the-art, surpassing the previous model by 2.3 on STAC and 1.5 on Molweni.
arXiv Detail & Related papers (2023-06-26T22:51:01Z) - Enriching the NArabizi Treebank: A Multifaceted Approach to Supporting
an Under-Resourced Language [0.0]
NArabizi is a Romanized form of North African Arabic used mostly on social media.
We introduce an enriched version of NArabizi Treebank with three main contributions.
arXiv Detail & Related papers (2023-06-26T17:27:31Z) - Constructing Code-mixed Universal Dependency Forest for Unbiased
Cross-lingual Relation Extraction [92.84968716013783]
Cross-lingual relation extraction (XRE) aggressively leverage the language-consistent structural features from the universal dependency (UD) resource.
We investigate an unbiased UD-based XRE transfer by constructing a type of code-mixed UD forest.
With such forest features, the gaps of UD-based XRE between the training and predicting phases can be effectively closed.
arXiv Detail & Related papers (2023-05-20T18:24:06Z) - NorBench -- A Benchmark for Norwegian Language Models [7.395163289937936]
We present NorBench: a suite of NLP tasks and probes for evaluating Norwegian language models (LMs) on standardized data splits and evaluation metrics.
We also introduce a range of new Norwegian language models (both encoder and encoder-decoder based)
We compare and analyze their performance, along with other existing LMs, across the different benchmark tests of NorBench.
arXiv Detail & Related papers (2023-05-06T00:20:24Z) - Nested Named Entity Recognition as Holistic Structure Parsing [92.8397338250383]
This work models the full nested NEs in a sentence as a holistic structure, then we propose a holistic structure parsing algorithm to disclose the entire NEs once for all.
Experiments show that our model yields promising results on widely-used benchmarks which approach or even achieve state-of-the-art.
arXiv Detail & Related papers (2022-04-17T12:48:20Z) - Incorporating Constituent Syntax for Coreference Resolution [50.71868417008133]
We propose a graph-based method to incorporate constituent syntactic structures.
We also explore to utilise higher-order neighbourhood information to encode rich structures in constituent trees.
Experiments on the English and Chinese portions of OntoNotes 5.0 benchmark show that our proposed model either beats a strong baseline or achieves new state-of-the-art performance.
arXiv Detail & Related papers (2022-02-22T07:40:42Z) - NorDiaChange: Diachronic Semantic Change Dataset for Norwegian [63.65426535861836]
NorDiaChange is the first diachronic semantic change dataset for Norwegian.
It covers about 80 Norwegian nouns manually annotated with graded semantic change over time.
arXiv Detail & Related papers (2022-01-13T18:27:33Z) - Named Entity Recognition and Linking Augmented with Large-Scale
Structured Data [3.211619859724085]
We describe our submissions to the 2nd and 3rd SlavNER Shared Tasks held at BSNLP 2019 and BSNLP 2021.
The tasks focused on the analysis of Named Entities in multilingual Web documents in Slavic languages with rich inflection.
Our solution takes advantage of large collections of both unstructured and structured documents.
arXiv Detail & Related papers (2021-04-27T20:10:18Z) - AMALGUM -- A Free, Balanced, Multilayer English Web Corpus [14.073494095236027]
We present a genre-balanced English web corpus totaling 4M tokens.
By tapping open online data sources the corpus is meant to offer a more sizable alternative to smaller manually created annotated data sets.
arXiv Detail & Related papers (2020-06-18T17:05:45Z) - e-SNLI-VE: Corrected Visual-Textual Entailment with Natural Language
Explanations [87.71914254873857]
We present a data collection effort to correct the class with the highest error rate in SNLI-VE.
Thirdly, we introduce e-SNLI-VE, which appends human-written natural language explanations to SNLI-VE.
We train models that learn from these explanations at training time, and output such explanations at testing time.
arXiv Detail & Related papers (2020-04-07T23:12:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.