The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and
Results
- URL: http://arxiv.org/abs/2110.04392v1
- Date: Fri, 8 Oct 2021 21:57:08 GMT
- Title: The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and
Results
- Authors: Marina Fomicheva, Piyawat Lertvittayakumjorn, Wei Zhao, Steffen Eger,
Yang Gao
- Abstract summary: Given a source-translation pair, this task requires not only to provide a sentence-level score indicating the overall quality of the translation, but also to explain this score by identifying the words that negatively impact translation quality.
We present the data, annotation guidelines and evaluation setup of the shared task, describe the six participating systems, and analyze the results.
- Score: 20.15825350326367
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce the Eval4NLP-2021shared task on explainable
quality estimation. Given a source-translation pair, this shared task requires
not only to provide a sentence-level score indicating the overall quality of
the translation, but also to explain this score by identifying the words that
negatively impact translation quality. We present the data, annotation
guidelines and evaluation setup of the shared task, describe the six
participating systems, and analyze the results. To the best of our knowledge,
this is the first shared task on explainable NLP evaluation metrics. Datasets
and results are available at https://github.com/eval4nlp/SharedTask2021.
Related papers
- Narrative Action Evaluation with Prompt-Guided Multimodal Interaction [60.281405999483]
Narrative action evaluation (NAE) aims to generate professional commentary that evaluates the execution of an action.
NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor.
We propose a prompt-guided multimodal interaction framework to facilitate the interaction between different modalities of information.
arXiv Detail & Related papers (2024-04-22T17:55:07Z) - Exploring Prompting Large Language Models as Explainable Metrics [0.0]
We propose a zero-shot prompt-based strategy for explainable evaluation of the summarization task using Large Language Models (LLMs)
The conducted experiments demonstrate the promising potential of LLMs as evaluation metrics in Natural Language Processing (NLP)
The performance of our best provided prompts achieved a Kendall correlation of 0.477 with human evaluations in the text summarization task on the test data.
arXiv Detail & Related papers (2023-11-20T06:06:22Z) - Unify word-level and span-level tasks: NJUNLP's Participation for the
WMT2023 Quality Estimation Shared Task [59.46906545506715]
We introduce the NJUNLP team to the WMT 2023 Quality Estimation (QE) shared task.
Our team submitted predictions for the English-German language pair on all two sub-tasks.
Our models achieved the best results in English-German for both word-level and fine-grained error span detection sub-tasks.
arXiv Detail & Related papers (2023-09-23T01:52:14Z) - SemEval-2022 Task 7: Identifying Plausible Clarifications of Implicit
and Underspecified Phrases in Instructional Texts [1.3586926359715774]
We describe SemEval-2022 Task 7, a shared task on rating the plausibility of clarifications in instructional texts.
The dataset for this task consists of manually clarified how-to guides for which we generated alternative clarifications and collected human plausibility judgements.
The task of participating systems was to automatically determine the plausibility of a clarification in the respective context.
arXiv Detail & Related papers (2023-09-21T14:19:04Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization
Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation.
SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality.
We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z) - Unsupervised Reference-Free Summary Quality Evaluation via Contrastive
Learning [66.30909748400023]
We propose to evaluate the summary qualities without reference summaries by unsupervised contrastive learning.
Specifically, we design a new metric which covers both linguistic qualities and semantic informativeness based on BERT.
Experiments on Newsroom and CNN/Daily Mail demonstrate that our new evaluation method outperforms other metrics even without reference summaries.
arXiv Detail & Related papers (2020-10-05T05:04:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.