Robust Evaluation Measures for Evaluating Social Biases in Masked
Language Models
- URL: http://arxiv.org/abs/2401.11601v1
- Date: Sun, 21 Jan 2024 21:21:51 GMT
- Title: Robust Evaluation Measures for Evaluating Social Biases in Masked
Language Models
- Authors: Yang Liu
- Abstract summary: We construct evaluation measures for the distributions of stereotypical and anti-stereotypical scores.
Our proposed measures are significantly more robust and interpretable than those proposed previously.
- Score: 6.697298321551588
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many evaluation measures are used to evaluate social biases in masked
language models (MLMs). However, we find that these previously proposed
evaluation measures are lacking robustness in scenarios with limited datasets.
This is because these measures are obtained by comparing the
pseudo-log-likelihood (PLL) scores of the stereotypical and anti-stereotypical
samples using an indicator function. The disadvantage is the limited mining of
the PLL score sets without capturing its distributional information. In this
paper, we represent a PLL score set as a Gaussian distribution and use Kullback
Leibler (KL) divergence and Jensen Shannon (JS) divergence to construct
evaluation measures for the distributions of stereotypical and
anti-stereotypical PLL scores. Experimental results on the publicly available
datasets StereoSet (SS) and CrowS-Pairs (CP) show that our proposed measures
are significantly more robust and interpretable than those proposed previously.
Related papers
- The Mismeasure of Man and Models: Evaluating Allocational Harms in Large Language Models [22.75594773147521]
We introduce Rank-Allocation-Based Bias Index (RABBI), a model-agnostic bias measure that assesses potential allocational harms arising from biases in large language models (LLMs)
Our results reveal that commonly-used bias metrics based on average performance gap and distribution distance fail to reliably capture group disparities in allocation outcomes.
Our work highlights the need to account for how models are used in contexts with limited resource constraints.
arXiv Detail & Related papers (2024-08-02T14:13:06Z) - Covariate Assisted Entity Ranking with Sparse Intrinsic Scores [3.2839905453386162]
We introduce novel model identification conditions and examine the regularized penalized Maximum Likelihood Estimator statistical rates.
We also apply our method to the goodness-of-fit test for models with no latent intrinsic scores.
arXiv Detail & Related papers (2024-07-09T19:58:54Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Constructing Holistic Measures for Social Biases in Masked Language
Models [17.45153670825904]
Masked Language Models (MLMs) have been successful in many natural language processing tasks.
Real-world stereotype biases are likely to be reflected ins due to their learning from large text corpora.
Two evaluation metrics, Kullback Leiblergence Score (KLDivS) and Jensen Shannon Divergence Score (JSDivS) are proposed to evaluate social biases ins.
arXiv Detail & Related papers (2023-05-12T23:09:06Z) - A Tale of Sampling and Estimation in Discounted Reinforcement Learning [50.43256303670011]
We present a minimax lower bound on the discounted mean estimation problem.
We show that estimating the mean by directly sampling from the discounted kernel of the Markov process brings compelling statistical properties.
arXiv Detail & Related papers (2023-04-11T09:13:17Z) - Rethinking Collaborative Metric Learning: Toward an Efficient
Alternative without Negative Sampling [156.7248383178991]
Collaborative Metric Learning (CML) paradigm has aroused wide interest in the area of recommendation systems (RS)
We find that negative sampling would lead to a biased estimation of the generalization error.
Motivated by this, we propose an efficient alternative without negative sampling for CML named textitSampling-Free Collaborative Metric Learning (SFCML)
arXiv Detail & Related papers (2022-06-23T08:50:22Z) - StaDRe and StaDRo: Reliability and Robustness Estimation of ML-based
Forecasting using Statistical Distance Measures [0.476203519165013]
This work focuses on the use of SafeML for time series data, and on reliability and robustness estimation of ML-forecasting methods using statistical distance measures.
We propose SDD-based Reliability Estimate (StaDRe) and SDD-based Robustness (StaDRo) measures.
With the help of a clustering technique, the similarity between the statistical properties of data seen during training and the forecasts is identified.
arXiv Detail & Related papers (2022-06-17T19:52:48Z) - Deconfounding Scores: Feature Representations for Causal Effect
Estimation with Weak Overlap [140.98628848491146]
We introduce deconfounding scores, which induce better overlap without biasing the target of estimation.
We show that deconfounding scores satisfy a zero-covariance condition that is identifiable in observed data.
In particular, we show that this technique could be an attractive alternative to standard regularizations.
arXiv Detail & Related papers (2021-04-12T18:50:11Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - Performance metrics for intervention-triggering prediction models do not
reflect an expected reduction in outcomes from using the model [71.9860741092209]
Clinical researchers often select among and evaluate risk prediction models.
Standard metrics calculated from retrospective data are only related to model utility under certain assumptions.
When predictions are delivered repeatedly throughout time, the relationship between standard metrics and utility is further complicated.
arXiv Detail & Related papers (2020-06-02T16:26:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.