SemEval-2022 Task 7: Identifying Plausible Clarifications of Implicit
and Underspecified Phrases in Instructional Texts
- URL: http://arxiv.org/abs/2309.12102v1
- Date: Thu, 21 Sep 2023 14:19:04 GMT
- Title: SemEval-2022 Task 7: Identifying Plausible Clarifications of Implicit
and Underspecified Phrases in Instructional Texts
- Authors: Michael Roth, Talita Anthonio, Anna Sauer
- Abstract summary: We describe SemEval-2022 Task 7, a shared task on rating the plausibility of clarifications in instructional texts.
The dataset for this task consists of manually clarified how-to guides for which we generated alternative clarifications and collected human plausibility judgements.
The task of participating systems was to automatically determine the plausibility of a clarification in the respective context.
- Score: 1.3586926359715774
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We describe SemEval-2022 Task 7, a shared task on rating the plausibility of
clarifications in instructional texts. The dataset for this task consists of
manually clarified how-to guides for which we generated alternative
clarifications and collected human plausibility judgements. The task of
participating systems was to automatically determine the plausibility of a
clarification in the respective context. In total, 21 participants took part in
this task, with the best system achieving an accuracy of 68.9%. This report
summarizes the results and findings from 8 teams and their system descriptions.
Finally, we show in an additional evaluation that predictions by the top
participating team make it possible to identify contexts with multiple
plausible clarifications with an accuracy of 75.2%.
Related papers
- SemEval-2024 Shared Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes [48.83290963506378]
This paper presents the results of the SHROOM, a shared task focused on detecting hallucinations.
We observe a number of key trends in how this approach was tackled.
While a majority of the teams did outperform our proposed baseline system, the performances of top-scoring systems are still consistent with a random handling of the more challenging items.
arXiv Detail & Related papers (2024-03-12T15:06:22Z) - Little Giants: Exploring the Potential of Small LLMs as Evaluation
Metrics in Summarization in the Eval4NLP 2023 Shared Task [53.163534619649866]
This paper focuses on assessing the effectiveness of prompt-based techniques to empower Large Language Models to handle the task of quality estimation.
We conducted systematic experiments with various prompting techniques, including standard prompting, prompts informed by annotator instructions, and innovative chain-of-thought prompting.
Our work reveals that combining these approaches using a "small", open source model (orca_mini_v3_7B) yields competitive results.
arXiv Detail & Related papers (2023-11-01T17:44:35Z) - BLP-2023 Task 2: Sentiment Analysis [7.725694295666573]
We present an overview of the BLP Sentiment Shared Task, organized as part of the inaugural BLP 2023 workshop.
The task is defined as the detection of sentiment in a given piece of social media text.
This paper provides a detailed account of the task setup, including dataset development and evaluation setup.
arXiv Detail & Related papers (2023-10-24T21:00:41Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - X-PuDu at SemEval-2022 Task 7: A Replaced Token Detection Task
Pre-trained Model with Pattern-aware Ensembling for Identifying Plausible
Clarifications [13.945286351253717]
This paper describes our winning system on SemEval 2022 Task 7: Identifying Plausible Clarifications of Implicit and Underspecified Phrases in instructional texts.
A replaced token detection pre-trained model is utilized with minorly different task-specific heads for SubTask-A: Multi-class Classification and SubTask-B: Ranking.
Our system achieves a 68.90% accuracy score and 0.8070 spearman's rank correlation score surpassing the 2nd place with a large margin by 2.7 and 2.2 percent points for SubTask-A and SubTask-B, respectively.
arXiv Detail & Related papers (2022-11-27T05:46:46Z) - SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence
Embedding [12.843166994677286]
This paper presents the shared task on Multilingualaticity Detection and Sentence Embedding.
It consists of two subtasks: (a) a binary classification one aimed at identifying whether a sentence contains an idiomatic expression, and (b) a task based on semantic text similarity which requires the model to adequately represent potentially idiomatic expressions in context.
The task had close to 100 registered participants organised into twenty five teams making over 650 and 150 submissions in the practice and evaluation phases respectively.
arXiv Detail & Related papers (2022-04-21T12:20:52Z) - Nowruz at SemEval-2022 Task 7: Tackling Cloze Tests with Transformers
and Ordinal Regression [1.9078991171384017]
This paper outlines the system using which team Nowruz participated in SemEval 2022 Task 7 Identifying Plausible Clarifications of Implicit and Underspecified Phrases.
arXiv Detail & Related papers (2022-04-01T16:36:10Z) - The Eval4NLP Shared Task on Explainable Quality Estimation: Overview and
Results [20.15825350326367]
Given a source-translation pair, this task requires not only to provide a sentence-level score indicating the overall quality of the translation, but also to explain this score by identifying the words that negatively impact translation quality.
We present the data, annotation guidelines and evaluation setup of the shared task, describe the six participating systems, and analyze the results.
arXiv Detail & Related papers (2021-10-08T21:57:08Z) - CAiRE in DialDoc21: Data Augmentation for Information-Seeking Dialogue
System [55.43871578056878]
In DialDoc21 competition, our system achieved 74.95 F1 score and 60.74 Exact Match score in subtask 1, and 37.72 SacreBLEU score in subtask 2.
arXiv Detail & Related papers (2021-06-07T11:40:55Z) - Reciprocal Feature Learning via Explicit and Implicit Tasks in Scene
Text Recognition [60.36540008537054]
In this work, we excavate the implicit task, character counting within the traditional text recognition, without additional labor annotation cost.
We design a two-branch reciprocal feature learning framework in order to adequately utilize the features from both the tasks.
Experiments on 7 benchmarks show the advantages of the proposed methods in both text recognition and the new-built character counting tasks.
arXiv Detail & Related papers (2021-05-13T12:27:35Z) - Generating Fact Checking Explanations [52.879658637466605]
A crucial piece of the puzzle that is still missing is to understand how to automate the most elaborate part of the process.
This paper provides the first study of how these explanations can be generated automatically based on available claim context.
Our results indicate that optimising both objectives at the same time, rather than training them separately, improves the performance of a fact checking system.
arXiv Detail & Related papers (2020-04-13T05:23:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.