Are Bias Evaluation Methods Biased ?
- URL: http://arxiv.org/abs/2506.17111v1
- Date: Fri, 20 Jun 2025 16:11:25 GMT
- Title: Are Bias Evaluation Methods Biased ?
- Authors: Lina Berrayana, Sean Rooney, Luis Garcés-Erice, Ioana Giurgiu,
- Abstract summary: The creation of benchmarks to evaluate the safety of Large Language Models is one of the key activities within the trusted AI community.<n>We investigate how robust such benchmarks are by using different approaches to rank a set of representative models for bias and compare how similar are the overall rankings.
- Score: 3.9748528039819977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The creation of benchmarks to evaluate the safety of Large Language Models is one of the key activities within the trusted AI community. These benchmarks allow models to be compared for different aspects of safety such as toxicity, bias, harmful behavior etc. Independent benchmarks adopt different approaches with distinct data sets and evaluation methods. We investigate how robust such benchmarks are by using different approaches to rank a set of representative models for bias and compare how similar are the overall rankings. We show that different but widely used bias evaluations methods result in disparate model rankings. We conclude with recommendations for the community in the usage of such benchmarks.
Related papers
- Where is this coming from? Making groundedness count in the evaluation of Document VQA models [12.951716701565019]
We argue that common evaluation metrics do not account for the semantic and multimodal groundedness of a model's outputs.<n>We propose a new evaluation methodology that accounts for the groundedness of predictions.<n>Our proposed methodology is parameterized in such a way that users can configure the score according to their preferences.
arXiv Detail & Related papers (2025-03-24T20:14:46Z) - Rethinking Relation Extraction: Beyond Shortcuts to Generalization with a Debiased Benchmark [53.876493664396506]
Benchmarks are crucial for evaluating machine learning algorithm performance, facilitating comparison and identifying superior solutions.<n>This paper addresses the issue of entity bias in relation extraction tasks, where models tend to rely on entity mentions rather than context.<n>We propose a debiased relation extraction benchmark DREB that breaks the pseudo-correlation between entity mentions and relation types through entity replacement.<n>To establish a new baseline on DREB, we introduce MixDebias, a debiasing method combining data-level and model training-level techniques.
arXiv Detail & Related papers (2025-01-02T17:01:06Z) - OffsetBias: Leveraging Debiased Data for Tuning Evaluators [1.5790747258969664]
We qualitatively identify six types of biases inherent in various judge models.
Fine-tuning on our dataset significantly enhances the robustness of judge models against biases.
arXiv Detail & Related papers (2024-07-09T05:16:22Z) - Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attribution Methods [49.62131719441252]
Attribution methods compute importance scores for input features to explain the output predictions of deep models.
In this work, we first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill.
We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.
arXiv Detail & Related papers (2024-05-02T13:48:37Z) - COBIAS: Assessing the Contextual Reliability of Bias Benchmarks for Language Models [14.594920595573038]
Large Language Models (LLMs) often inherit biases from the web data they are trained on, which contains stereotypes and prejudices.<n>Current methods for evaluating and mitigating these biases rely on bias-benchmark datasets.<n>We introduce a contextual reliability framework, which evaluates model robustness to biased statements by considering the various contexts in which they may appear.
arXiv Detail & Related papers (2024-02-22T10:46:11Z) - Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity.
To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs.
We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z) - Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development.<n> Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z) - Characterizing Fairness Over the Set of Good Models Under Selective
Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance.
We provide tractable algorithms to compute the range of attainable group-level predictive disparities.
We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z) - LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model.
We propose LOGAN, a new bias detection technique based on clustering.
Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z) - On the Ambiguity of Rank-Based Evaluation of Entity Alignment or Link
Prediction Methods [27.27230441498167]
We take a closer look at the evaluation of two families of methods for enriching information from knowledge graphs: Link Prediction and Entity Alignment.
In particular, we demonstrate that all existing scores can hardly be used to compare results across different datasets.
We show that this leads to various problems in the interpretation of results, which may support misleading conclusions.
arXiv Detail & Related papers (2020-02-17T12:26:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.