Collective Human Opinions in Semantic Textual Similarity
- URL: http://arxiv.org/abs/2308.04114v1
- Date: Tue, 8 Aug 2023 08:00:52 GMT
- Title: Collective Human Opinions in Semantic Textual Similarity
- Authors: Yuxia Wang, Shimin Tao, Ning Xie, Hao Yang, Timothy Baldwin, Karin
Verspoor
- Abstract summary: We introduce USTS, the first Uncertainty-aware STS dataset with 15,000 Chinese sentence pairs and 150,000 labels.
We show that current STS models cannot capture the variance caused by human disagreement on individual instances.
- Score: 36.780812651679376
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the subjective nature of semantic textual similarity (STS) and
pervasive disagreements in STS annotation, existing benchmarks have used
averaged human ratings as the gold standard. Averaging masks the true
distribution of human opinions on examples of low agreement, and prevents
models from capturing the semantic vagueness that the individual ratings
represent. In this work, we introduce USTS, the first Uncertainty-aware STS
dataset with ~15,000 Chinese sentence pairs and 150,000 labels, to study
collective human opinions in STS. Analysis reveals that neither a scalar nor a
single Gaussian fits a set of observed judgements adequately. We further show
that current STS models cannot capture the variance caused by human
disagreement on individual instances, but rather reflect the predictive
confidence over the aggregate dataset.
Related papers
- Robust Evaluation Measures for Evaluating Social Biases in Masked
Language Models [6.697298321551588]
We construct evaluation measures for the distributions of stereotypical and anti-stereotypical scores.
Our proposed measures are significantly more robust and interpretable than those proposed previously.
arXiv Detail & Related papers (2024-01-21T21:21:51Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - Semantic Image Attack for Visual Model Diagnosis [80.36063332820568]
In practice, metric analysis on a specific train and test dataset does not guarantee reliable or fair ML models.
This paper proposes Semantic Image Attack (SIA), a method based on the adversarial attack that provides semantic adversarial images.
arXiv Detail & Related papers (2023-03-23T03:13:04Z) - Comparing Intrinsic Gender Bias Evaluation Measures without using Human
Annotated Examples [33.044775876807826]
We propose a method to compare intrinsic gender bias evaluation measures without relying on human-annotated examples.
Specifically, we create bias-controlled versions of language models using varying amounts of male vs. female gendered sentences.
The rank correlation between the computed bias scores and the gender proportions used to fine-tune the PLMs is computed.
arXiv Detail & Related papers (2023-01-28T03:11:50Z) - Holistic Approach to Measure Sample-level Adversarial Vulnerability and
its Utility in Building Trustworthy Systems [17.707594255626216]
Adversarial attack perturbs an image with an imperceptible noise, leading to incorrect model prediction.
We propose a holistic approach for quantifying adversarial vulnerability of a sample by combining different perspectives.
We demonstrate that by reliably estimating adversarial vulnerability at the sample level, it is possible to develop a trustworthy system.
arXiv Detail & Related papers (2022-05-05T12:36:17Z) - SMURF: SeMantic and linguistic UndeRstanding Fusion for Caption
Evaluation via Typicality Analysis [20.026835809227283]
We introduce "typicality", a new formulation of evaluation rooted in information theory.
We show how these decomposed dimensions of semantics and fluency provide greater system-level insight into captioner differences.
Our proposed metrics along with their combination, SMURF, achieve state-of-the-art correlation with human judgment when compared with other rule-based evaluation metrics.
arXiv Detail & Related papers (2021-06-02T19:58:20Z) - A comprehensive comparative evaluation and analysis of Distributional
Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT.
The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous.
We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z) - Double Perturbation: On the Robustness of Robustness and Counterfactual
Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset.
We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z) - What Can We Learn from Collective Human Opinions on Natural Language
Inference Data? [88.90490998032429]
ChaosNLI is a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS.
This dataset is created by collecting 100 annotations per example for 3,113 examples in SNLI and MNLI and 1,532 examples in Abductive-NLI.
arXiv Detail & Related papers (2020-10-07T17:26:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.