Out-of-Domain Evaluation of Finnish Dependency Parsing
- URL: http://arxiv.org/abs/2204.10621v1
- Date: Fri, 22 Apr 2022 10:34:19 GMT
- Title: Out-of-Domain Evaluation of Finnish Dependency Parsing
- Authors: Jenna Kanerva and Filip Ginter
- Abstract summary: In many real world applications the data on which the model is applied may very substantially differ from the characteristics of the training data.
In this paper, we focus on Finnish out-of-domain parsing by introducing a novel UD Finnish-OOD out-of-domain treebank.
We present extensive out-of-domain evaluation utilizing the available section-level information from three different UD treebanks.
- Score: 0.8957681069740162
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The prevailing practice in the academia is to evaluate the model performance
on in-domain evaluation data typically set aside from the training corpus.
However, in many real world applications the data on which the model is applied
may very substantially differ from the characteristics of the training data. In
this paper, we focus on Finnish out-of-domain parsing by introducing a novel UD
Finnish-OOD out-of-domain treebank including five very distinct data sources
(web documents, clinical, online discussions, tweets, and poetry), and a total
of 19,382 syntactic words in 2,122 sentences released under the Universal
Dependencies framework. Together with the new treebank, we present extensive
out-of-domain parsing evaluation utilizing the available section-level
information from three different Finnish UD treebanks (TDT, PUD, OOD). Compared
to the previously existing treebanks, the new Finnish-OOD is shown include
sections more challenging for the general parser, creating an interesting
evaluation setting and yielding valuable information for those applying the
parser outside of its training domain.
Related papers
- Thai Universal Dependency Treebank [0.0]
We introduce Thai Universal Dependency Treebank (TUD), a new largest Thai treebank consisting of 3,627 trees annotated in accordance with the Universal Dependencies (UD) framework.
We then benchmark dependency parsing models that incorporate pretrained encoders and train them on Thai-PUD and our TUD.
The results show that most of our models can outperform other models reported in previous papers and provide insight into the optimal choices of components in Thai dependencys.
arXiv Detail & Related papers (2024-05-13T09:48:13Z) - Natural Language Processing for Dialects of a Language: A Survey [56.93337350526933]
State-of-the-art natural language processing (NLP) models are trained on massive training corpora, and report a superlative performance on evaluation datasets.
This survey delves into an important attribute of these datasets: the dialect of a language.
Motivated by the performance degradation of NLP models for dialectic datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches.
arXiv Detail & Related papers (2024-01-11T03:04:38Z) - Disco-Bench: A Discourse-Aware Evaluation Benchmark for Language
Modelling [70.23876429382969]
We propose a benchmark that can evaluate intra-sentence discourse properties across a diverse set of NLP tasks.
Disco-Bench consists of 9 document-level testsets in the literature domain, which contain rich discourse phenomena.
For linguistic analysis, we also design a diagnostic test suite that can examine whether the target models learn discourse knowledge.
arXiv Detail & Related papers (2023-07-16T15:18:25Z) - BLUEX: A benchmark based on Brazilian Leading Universities Entrance
eXams [4.9069311006119865]
We introduce BLUEX, a dataset of entrance exams from the two leading universities in Brazil: UNI CAMP and USP.
The dataset includes annotated metadata for evaluating the performance of NLP models on a variety of subjects.
We establish a benchmark through experiments with state-of-the-art LMs, demonstrating its potential for advancing the state-of-the-art in natural language understanding and reasoning in Portuguese.
arXiv Detail & Related papers (2023-07-11T16:25:09Z) - Multi-Dimensional Evaluation of Text Summarization with In-Context
Learning [79.02280189976562]
In this paper, we study the efficacy of large language models as multi-dimensional evaluators using in-context learning.
Our experiments show that in-context learning-based evaluators are competitive with learned evaluation frameworks for the task of text summarization.
We then analyze the effects of factors such as the selection and number of in-context examples on performance.
arXiv Detail & Related papers (2023-06-01T23:27:49Z) - Why Can't Discourse Parsing Generalize? A Thorough Investigation of the
Impact of Data Diversity [10.609715843964263]
We show that state-of-the-art architectures trained on the standard English newswire benchmark do not generalize well.
We quantify the impact of genre diversity in training data for achieving generalization to text types unseen.
To our knowledge, this study is the first to fully evaluate cross-corpus RST parsing generalizability on complete trees.
arXiv Detail & Related papers (2023-02-13T16:11:58Z) - Towards a Unified Multi-Dimensional Evaluator for Text Generation [101.47008809623202]
We propose a unified multi-dimensional evaluator UniEval for Natural Language Generation (NLG)
We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions.
Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics.
arXiv Detail & Related papers (2022-10-13T17:17:03Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z) - Hierarchical Bi-Directional Self-Attention Networks for Paper Review
Rating Recommendation [81.55533657694016]
We propose a Hierarchical bi-directional self-attention Network framework (HabNet) for paper review rating prediction and recommendation.
Specifically, we leverage the hierarchical structure of the paper reviews with three levels of encoders: sentence encoder (level one), intra-review encoder (level two) and inter-review encoder (level three)
We are able to identify useful predictors to make the final acceptance decision, as well as to help discover the inconsistency between numerical review ratings and text sentiment conveyed by reviewers.
arXiv Detail & Related papers (2020-11-02T08:07:50Z) - Likelihood Ratios and Generative Classifiers for Unsupervised
Out-of-Domain Detection In Task Oriented Dialog [24.653367921046442]
We focus on OOD detection for natural language sentence inputs to task-based dialog systems.
We release a dataset of 4K OOD examples for the publicly available dataset fromSchuster et al.
arXiv Detail & Related papers (2019-12-30T03:31:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.