Unmasking the Mask -- Evaluating Social Biases in Masked Language Models
- URL: http://arxiv.org/abs/2104.07496v1
- Date: Thu, 15 Apr 2021 14:40:42 GMT
- Title: Unmasking the Mask -- Evaluating Social Biases in Masked Language Models
- Authors: Masahiro Kaneko and Danushka Bollegala
- Abstract summary: Masked Language Models (MLMs) have superior performances in numerous downstream NLP tasks when used as text encoders.
We propose All Unmasked Likelihood (AUL), a bias evaluation measure that predicts all tokens in a test case.
We also propose AUL with attention weights (AULA) to evaluate tokens based on their importance in a sentence.
- Score: 28.378270372391498
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked Language Models (MLMs) have shown superior performances in numerous
downstream NLP tasks when used as text encoders. Unfortunately, MLMs also
demonstrate significantly worrying levels of social biases. We show that the
previously proposed evaluation metrics for quantifying the social biases in
MLMs are problematic due to following reasons: (1) prediction accuracy of the
masked tokens itself tend to be low in some MLMs, which raises questions
regarding the reliability of the evaluation metrics that use the (pseudo)
likelihood of the predicted tokens, and (2) the correlation between the
prediction accuracy of the mask and the performance in downstream NLP tasks is
not taken into consideration, and (3) high frequency words in the training data
are masked more often, introducing noise due to this selection bias in the test
cases. To overcome the above-mentioned disfluencies, we propose All Unmasked
Likelihood (AUL), a bias evaluation measure that predicts all tokens in a test
case given the MLM embedding of the unmasked input. We find that AUL accurately
detects different types of biases in MLMs. We also propose AUL with attention
weights (AULA) to evaluate tokens based on their importance in a sentence.
However, unlike AUL and AULA, previously proposed bias evaluation measures for
MLMs systematically overestimate the measured biases, and are heavily
influenced by the unmasked tokens in the context.
Related papers
- Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility.
We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge.
Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z) - MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators [53.91199933655421]
Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment.
We introduce a universal and training-free framework, $textbfMQM-APE, to enhance the quality of error annotations predicted by LLM evaluators.
arXiv Detail & Related papers (2024-09-22T06:43:40Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Towards Probabilistically-Sound Beam Search with Masked Language Models [0.0]
Beam search masked language models (MLMs) is challenging in part because joint probability over distributions are not available.
estimating such distributions has important domain-specific applications such as ancient text restoration and protein engineering.
Here we present probabilistically-sound methods for beam search with sequences.
arXiv Detail & Related papers (2024-02-22T23:36:26Z) - Measuring Social Biases in Masked Language Models by Proxy of Prediction
Quality [0.0]
Social political scientists often aim to discover and measure distinct biases from text data representations (embeddings)
In this paper, we evaluate the social biases encoded by transformers trained with a masked language modeling objective.
We find that proposed measures produce more accurate estimations of relative preference for biased sentences between transformers than others based on our methods.
arXiv Detail & Related papers (2024-02-21T17:33:13Z) - Which Syntactic Capabilities Are Statistically Learned by Masked
Language Models for Code? [51.29970742152668]
We highlight relying on accuracy-based measurements may lead to an overestimation of models' capabilities.
To address these issues, we introduce a technique called SyntaxEval in Syntactic Capabilities.
arXiv Detail & Related papers (2024-01-03T02:44:02Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Constructing Holistic Measures for Social Biases in Masked Language
Models [17.45153670825904]
Masked Language Models (MLMs) have been successful in many natural language processing tasks.
Real-world stereotype biases are likely to be reflected ins due to their learning from large text corpora.
Two evaluation metrics, Kullback Leiblergence Score (KLDivS) and Jensen Shannon Divergence Score (JSDivS) are proposed to evaluate social biases ins.
arXiv Detail & Related papers (2023-05-12T23:09:06Z) - Inconsistencies in Masked Language Models [20.320583166619528]
Masked language models (MLMs) can provide distributions of tokens in the masked positions in a sequence.
distributions corresponding to different masking patterns can demonstrate considerable inconsistencies.
We propose an inference-time strategy for fors called Ensemble of Conditionals.
arXiv Detail & Related papers (2022-12-30T22:53:25Z) - Debiasing isn't enough! -- On the Effectiveness of Debiasing MLMs and
their Social Biases in Downstream Tasks [33.044775876807826]
We study intrinsic relationship between task-agnostic and task-specific extrinsic social bias evaluation measures for Masked Language Models (MLMs)
We find that there exists only a weak correlation between these two types of evaluation measures.
arXiv Detail & Related papers (2022-10-06T14:08:57Z) - Improving Self-supervised Pre-training via a Fully-Explored Masked
Language Model [57.77981008219654]
Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training.
We propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments.
arXiv Detail & Related papers (2020-10-12T21:28:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.