A Unified Evaluation Framework for Multi-Annotator Tendency Learning
- URL: http://arxiv.org/abs/2508.10393v1
- Date: Thu, 14 Aug 2025 06:50:20 GMT
- Title: A Unified Evaluation Framework for Multi-Annotator Tendency Learning
- Authors: Liyun Zhang, Jingcheng Ke, Shenli Fan, Xuanmeng Sha, Zheng Lian,
- Abstract summary: We propose the first unified evaluation framework with two novel metrics.<n> Difference of Inter-annotator Consistency (DIC) quantifies how well models capture annotator tendencies.<n> Behavior Alignment Explainability (BAE) evaluates how well model explanations reflect annotator behavior and decision relevance.
- Score: 6.801084054135531
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have emerged in multi-annotator learning that shift focus from Consensus-oriented Learning (CoL), which aggregates multiple annotations into a single ground-truth prediction, to Individual Tendency Learning (ITL), which models annotator-specific labeling behavior patterns (i.e., tendency) to provide explanation analysis for understanding annotator decisions. However, no evaluation framework currently exists to assess whether ITL methods truly capture individual tendencies and provide meaningful behavioral explanations. To address this gap, we propose the first unified evaluation framework with two novel metrics: (1) Difference of Inter-annotator Consistency (DIC) quantifies how well models capture annotator tendencies by comparing predicted inter-annotator similarity structures with ground-truth; (2) Behavior Alignment Explainability (BAE) evaluates how well model explanations reflect annotator behavior and decision relevance by aligning explainability-derived with ground-truth labeling similarity structures via Multidimensional Scaling (MDS). Extensive experiments validate the effectiveness of our proposed evaluation framework.
Related papers
- Beyond Consensus: Perspectivist Modeling and Evaluation of Annotator Disagreement in NLP [25.097081181685613]
Annotator disagreement is widespread in NLP, particularly for subjective and ambiguous tasks such as toxicity detection and stance analysis.<n>We first present a domain-agnostic taxonomy of the sources of disagreement spanning data, task, and annotator factors.<n>We then synthesize modeling approaches using a common framework defined by prediction targets and pooling structure.
arXiv Detail & Related papers (2026-01-14T01:26:29Z) - Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check [60.77691669644931]
We propose Functional Alignment for Distributional Equivalence (FADE), a novel metric that measures distributional similarity between unlearned and reference models.<n>We show that FADE captures functional alignment across the entire output distribution, providing a principled assessment of genuine unlearning.<n>These findings expose fundamental gaps in current evaluation practices and demonstrate that FADE provides a more robust foundation for developing and assessing truly effective unlearning methods.
arXiv Detail & Related papers (2025-10-14T20:50:30Z) - LeWiDi-2025 at NLPerspectives: The Third Edition of the Learning with Disagreements Shared Task [38.500623751317896]
The LEWIDI series of shared tasks on Learning With Disagreements was established to promote this approach to training and evaluating AI models.<n>The third edition of the task builds on this goal by extending the LEWIDI benchmark to four datasets spanning paraphrase identification, irony detection, sarcasm detection, and natural language inference.
arXiv Detail & Related papers (2025-10-09T17:04:28Z) - QuMAB: Query-based Multi-Annotator Behavior Modeling with Reliability under Sparse Labels [23.555446749682467]
Multi-annotator learning traditionally aggregates diverse annotations to approximate a single ground truth, treating disagreements as noise.<n>We introduce a paradigm shift from sample-wise aggregation to annotator-wise behavior modeling.<n>By treating annotator disagreements as valuable information rather than noise, modeling annotator-specific behavior patterns can reconstruct unlabeled data to reduce annotation cost, enhance aggregation reliability, and explain annotator decision behavior.
arXiv Detail & Related papers (2025-07-23T16:17:43Z) - Rethinking Robustness in Machine Learning: A Posterior Agreement Approach [45.284633306624634]
Posterior Agreement (PA) theory of model validation provides a principled framework for robustness evaluation.<n>We show that the PA metric provides a sensible and consistent analysis of the vulnerabilities in learning algorithms, even in the presence of few observations.
arXiv Detail & Related papers (2025-03-20T16:03:39Z) - Estimating Commonsense Plausibility through Semantic Shifts [66.06254418551737]
We propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts.<n> Evaluations on two types of fine-grained commonsense plausibility estimation tasks show that ComPaSS consistently outperforms baselines.
arXiv Detail & Related papers (2025-02-19T06:31:06Z) - Model-free Methods for Event History Analysis and Efficient Adjustment (PhD Thesis) [55.2480439325792]
This thesis is a series of independent contributions to statistics unified by a model-free perspective.<n>The first chapter elaborates on how a model-free perspective can be used to formulate flexible methods that leverage prediction techniques from machine learning.<n>The second chapter studies the concept of local independence, which describes whether the evolution of one process is directly influenced by another.
arXiv Detail & Related papers (2025-02-11T19:24:09Z) - Explaining the Unexplained: Revealing Hidden Correlations for Better Interpretability [1.8274323268621635]
Real Explainer (RealExp) is an interpretability method that decouples the Shapley Value into individual feature importance and feature correlation importance.<n>RealExp enhances interpretability by precisely quantifying both individual feature contributions and their interactions.
arXiv Detail & Related papers (2024-12-02T10:50:50Z) - Counterfactuals of Counterfactuals: a back-translation-inspired approach
to analyse counterfactual editors [3.4253416336476246]
We focus on the analysis of counterfactual, contrastive explanations.
We propose a new back translation-inspired evaluation methodology.
We show that by iteratively feeding the counterfactual to the explainer we can obtain valuable insights into the behaviour of both the predictor and the explainer models.
arXiv Detail & Related papers (2023-05-26T16:04:28Z) - Unifying Gradient Estimators for Meta-Reinforcement Learning via
Off-Policy Evaluation [53.83642844626703]
We provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation.
Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates.
arXiv Detail & Related papers (2021-06-24T15:58:01Z) - Trusted Multi-View Classification [76.73585034192894]
We propose a novel multi-view classification method, termed trusted multi-view classification.
It provides a new paradigm for multi-view learning by dynamically integrating different views at an evidence level.
The proposed algorithm jointly utilizes multiple views to promote both classification reliability and robustness.
arXiv Detail & Related papers (2021-02-03T13:30:26Z) - Learning Causal Semantic Representation for Out-of-Distribution
Prediction [125.38836464226092]
We propose a Causal Semantic Generative model (CSG) based on a causal reasoning so that the two factors are modeled separately.
We show that CSG can identify the semantic factor by fitting training data, and this semantic-identification guarantees the boundedness of OOD generalization error.
arXiv Detail & Related papers (2020-11-03T13:16:05Z) - Evaluations and Methods for Explanation through Robustness Analysis [117.7235152610957]
We establish a novel set of evaluation criteria for such feature based explanations by analysis.
We obtain new explanations that are loosely necessary and sufficient for a prediction.
We extend the explanation to extract the set of features that would move the current prediction to a target class.
arXiv Detail & Related papers (2020-05-31T05:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.