Investigating the Nature of Disagreements on Mid-Scale Ratings: A Case
Study on the Abstractness-Concreteness Continuum
- URL: http://arxiv.org/abs/2311.04563v1
- Date: Wed, 8 Nov 2023 09:52:58 GMT
- Title: Investigating the Nature of Disagreements on Mid-Scale Ratings: A Case
Study on the Abstractness-Concreteness Continuum
- Authors: Urban Knuple\v{s}, Diego Frassinelli, Sabine Schulte im Walde
- Abstract summary: Humans tend to strongly agree on ratings on a scale for extreme cases, but judgements on mid-scale words exhibit more disagreement.
Our study focuses on concreteness ratings and implements correlations and supervised classification to identify salient multi-modal characteristics of mid-scale words.
Our results suggest to either fine-tune or filter mid-scale target words before utilising them.
- Score: 8.086165096687772
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Humans tend to strongly agree on ratings on a scale for extreme cases (e.g.,
a CAT is judged as very concrete), but judgements on mid-scale words exhibit
more disagreement. Yet, collected rating norms are heavily exploited across
disciplines. Our study focuses on concreteness ratings and (i) implements
correlations and supervised classification to identify salient multi-modal
characteristics of mid-scale words, and (ii) applies a hard clustering to
identify patterns of systematic disagreement across raters. Our results suggest
to either fine-tune or filter mid-scale target words before utilising them.
Related papers
- Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes.
We find that the majority of disagreements are in opposition with standard reward modeling approaches.
We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z) - Rater Cohesion and Quality from a Vicarious Perspective [22.445283423317754]
Vicarious annotation is a method for breaking down disagreement by asking raters how they think others would annotate the data.
We employ rater cohesion metrics to study the potential influence of political affiliations and demographic backgrounds on raters' perceptions of offense.
We study how the rater quality metrics influence the in-group and cross-group rater cohesion across the personal and vicarious levels.
arXiv Detail & Related papers (2024-08-15T20:37:36Z) - Towards Fine-Grained Citation Evaluation in Generated Text: A Comparative Analysis of Faithfulness Metrics [22.041561519672456]
Large language models (LLMs) often produce unsupported or unverifiable content, known as "hallucinations"
We propose a comparative evaluation framework that assesses the metric effectiveness in distinguishing citations between three-category support levels.
Our results show no single metric consistently excels across all evaluations, revealing the complexity of assessing fine-grained support.
arXiv Detail & Related papers (2024-06-21T15:57:24Z) - RankCSE: Unsupervised Sentence Representations Learning via Learning to
Rank [54.854714257687334]
We propose a novel approach, RankCSE, for unsupervised sentence representation learning.
It incorporates ranking consistency and ranking distillation with contrastive learning into a unified framework.
An extensive set of experiments are conducted on both semantic textual similarity (STS) and transfer (TR) tasks.
arXiv Detail & Related papers (2023-05-26T08:27:07Z) - Using Natural Language Explanations to Rescale Human Judgments [81.66697572357477]
We propose a method to rescale ordinal annotations and explanations using large language models (LLMs)
We feed annotators' Likert ratings and corresponding explanations into an LLM and prompt it to produce a numeric score anchored in a scoring rubric.
Our method rescales the raw judgments without impacting agreement and brings the scores closer to human judgments grounded in the same scoring rubric.
arXiv Detail & Related papers (2023-05-24T06:19:14Z) - SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption
Evaluation via Typicality Analysis [20.026835809227283]
We introduce "typicality", a new formulation of evaluation rooted in information theory.
We show how these decomposed dimensions of semantics and fluency provide greater system-level insight into captioner differences.
Our proposed metrics along with their combination, SMURF, achieve state-of-the-art correlation with human judgment when compared with other rule-based evaluation metrics.
arXiv Detail & Related papers (2021-06-02T19:58:20Z) - Towards Quantifiable Dialogue Coherence Evaluation [126.55560816209756]
Quantifiable Dialogue Coherence Evaluation (QuantiDCE) is a novel framework aiming to train a quantifiable dialogue coherence metric.
QuantiDCE includes two training stages, Multi-Level Ranking (MLR) pre-training and Knowledge Distillation (KD) fine-tuning.
Experimental results show that the model trained by QuantiDCE presents stronger correlations with human judgements than the other state-of-the-art metrics.
arXiv Detail & Related papers (2021-06-01T14:11:17Z) - Dynamic Semantic Matching and Aggregation Network for Few-shot Intent
Detection [69.2370349274216]
Few-shot Intent Detection is challenging due to the scarcity of available annotated utterances.
Semantic components are distilled from utterances via multi-head self-attention.
Our method provides a comprehensive matching measure to enhance representations of both labeled and unlabeled instances.
arXiv Detail & Related papers (2020-10-06T05:16:38Z) - Uncertainty-aware Score Distribution Learning for Action Quality
Assessment [91.05846506274881]
We propose an uncertainty-aware score distribution learning (USDL) approach for action quality assessment (AQA)
Specifically, we regard an action as an instance associated with a score distribution, which describes the probability of different evaluated scores.
Under the circumstance where fine-grained score labels are available, we devise a multi-path uncertainty-aware score distributions learning (MUSDL) method to explore the disentangled components of a score.
arXiv Detail & Related papers (2020-06-13T15:41:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.