Related papers: LeWiDi-2025 at NLPerspectives: The Third Edition of the Learning with Disagreements Shared Task

LeWiDi-2025 at NLPerspectives: The Third Edition of the Learning with Disagreements Shared Task

URL: http://arxiv.org/abs/2510.08460v1
Date: Thu, 09 Oct 2025 17:04:28 GMT
Title: LeWiDi-2025 at NLPerspectives: The Third Edition of the Learning with Disagreements Shared Task
Authors: Elisa Leonardelli, Silvia Casola, Siyao Peng, Giulia Rizzi, Valerio Basile, Elisabetta Fersini, Diego Frassinelli, Hyewon Jang, Maja Pavlovic, Barbara Plank, Massimo Poesio,
Abstract summary: The LEWIDI series of shared tasks on Learning With Disagreements was established to promote this approach to training and evaluating AI models.<n>The third edition of the task builds on this goal by extending the LEWIDI benchmark to four datasets spanning paraphrase identification, irony detection, sarcasm detection, and natural language inference.
Score: 38.500623751317896
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many researchers have reached the conclusion that AI models should be trained to be aware of the possibility of variation and disagreement in human judgments, and evaluated as per their ability to recognize such variation. The LEWIDI series of shared tasks on Learning With Disagreements was established to promote this approach to training and evaluating AI models, by making suitable datasets more accessible and by developing evaluation methods. The third edition of the task builds on this goal by extending the LEWIDI benchmark to four datasets spanning paraphrase identification, irony detection, sarcasm detection, and natural language inference, with labeling schemes that include not only categorical judgments as in previous editions, but ordinal judgments as well. Another novelty is that we adopt two complementary paradigms to evaluate disagreement-aware systems: the soft-label approach, in which models predict population-level distributions of judgments, and the perspectivist approach, in which models predict the interpretations of individual annotators. Crucially, we moved beyond standard metrics such as cross-entropy, and tested new evaluation metrics for the two paradigms. The task attracted diverse participation, and the results provide insights into the strengths and limitations of methods to modeling variation. Together, these contributions strengthen LEWIDI as a framework and provide new resources, benchmarks, and findings to support the development of disagreement-aware technologies.

Related papers

A Unified Evaluation Framework for Multi-Annotator Tendency Learning [6.801084054135531]
We propose the first unified evaluation framework with two novel metrics.<n> Difference of Inter-annotator Consistency (DIC) quantifies how well models capture annotator tendencies.<n> Behavior Alignment Explainability (BAE) evaluates how well model explanations reflect annotator behavior and decision relevance.
arXiv Detail & Related papers (2025-08-14T06:50:20Z)
Modeling Ranking Properties with In-Context Learning [13.34397013426643]
We propose an in-context learning (ICL) approach that eliminates the need for task-specific training for each ranking scenario and dataset.<n>Our method relies on a small number of example rankings that demonstrate the desired trade-offs between objectives for past queries similar to the current input.
arXiv Detail & Related papers (2025-05-23T10:58:22Z)
Towards Unified Attribution in Explainable AI, Data-Centric AI, and Mechanistic Interpretability [25.096987279649436]
We argue that feature, data, and component attribution methods share fundamental similarities, and a unified view of them benefits both interpretability and broader AI research.<n>We first analyze popular methods for these three types of attributions and present a unified view demonstrating that these seemingly distinct methods employ similar techniques over different aspects and thus differ primarily in their perspectives rather than techniques.<n>Then, we demonstrate how this unified view enhances understanding of existing attribution methods, highlights shared concepts and evaluation criteria among these methods, and leads to new research directions both in interpretability research, by addressing common challenges and facilitating cross-attribution innovation, and in AI more
arXiv Detail & Related papers (2025-01-31T04:42:45Z)
Towards a Unified Framework for Evaluating Explanations [0.6138671548064356]
We argue that explanations serve as mediators between models and stakeholders, whether for intrinsically interpretable models or opaque black-box models. We illustrate these criteria, as well as specific evaluation methods, using examples from an ongoing study of an interpretable neural network for predicting a particular learner behavior.
arXiv Detail & Related papers (2024-05-22T21:49:28Z)
Robust Training of Federated Models with Extremely Label Deficiency [84.00832527512148]
Federated semi-supervised learning (FSSL) has emerged as a powerful paradigm for collaboratively training machine learning models using distributed data with label deficiency. We propose a novel twin-model paradigm, called Twin-sight, designed to enhance mutual guidance by providing insights from different perspectives of labeled and unlabeled data. Our comprehensive experiments on four benchmark datasets provide substantial evidence that Twin-sight can significantly outperform state-of-the-art methods across various experimental settings.
arXiv Detail & Related papers (2024-02-22T10:19:34Z)
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs) This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z)
Co-guiding for Multi-intent Spoken Language Understanding [53.30511968323911]
We propose a novel model termed Co-guiding Net, which implements a two-stage framework achieving the mutual guidances between the two tasks. For the first stage, we propose single-task supervised contrastive learning, and for the second stage, we propose co-guiding supervised contrastive learning. Experiment results on multi-intent SLU show that our model outperforms existing models by a large margin.
arXiv Detail & Related papers (2023-11-22T08:06:22Z)
Bias and Fairness in Large Language Models: A Survey [73.87651986156006]
We present a comprehensive survey of bias evaluation and mitigation techniques for large language models (LLMs) We first consolidate, formalize, and expand notions of social bias and fairness in natural language processing. We then unify the literature by proposing three intuitive, two for bias evaluation, and one for mitigation.
arXiv Detail & Related papers (2023-09-02T00:32:55Z)
iLab at SemEval-2023 Task 11 Le-Wi-Di: Modelling Disagreement or Modelling Perspectives? [17.310208612897814]
We adapt a multi-task architecture to evaluate its performance on the SEMEVAL Task 11. We find that a multi-task approach performed poorly on datasets which contained distinct annotator opinions. We argue that perspectivist approaches are preferable because they enable decision makers to amplify minority views.
arXiv Detail & Related papers (2023-05-10T11:55:17Z)
Exploring the Trade-off between Plausibility, Change Intensity and Adversarial Power in Counterfactual Explanations using Multi-objective Optimization [73.89239820192894]
We argue that automated counterfactual generation should regard several aspects of the produced adversarial instances. We present a novel framework for the generation of counterfactual examples.
arXiv Detail & Related papers (2022-05-20T15:02:53Z)
On the Faithfulness Measurements for Model Interpretations [100.2730234575114]
Post-hoc interpretations aim to uncover how natural language processing (NLP) models make predictions. To tackle these issues, we start with three criteria: the removal-based criterion, the sensitivity of interpretations, and the stability of interpretations. Motivated by the desideratum of these faithfulness notions, we introduce a new class of interpretation methods that adopt techniques from the adversarial domain.
arXiv Detail & Related papers (2021-04-18T09:19:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.