Related papers: Comparative Separation: Evaluating Separation on Comparative Judgment Test Data

Comparative Separation: Evaluating Separation on Comparative Judgment Test Data

URL: http://arxiv.org/abs/2601.06761v1
Date: Sun, 11 Jan 2026 03:39:45 GMT
Title: Comparative Separation: Evaluating Separation on Comparative Judgment Test Data
Authors: Xiaoyin Xi, Neeku Capak, Kate Stockwell, Zhe Yu,
Abstract summary: This research seeks to benefit the software engineering society by proposing comparative separation.<n>We show that in binary classification problems, comparative separation is equivalent to separation.
Score: 1.9729979239580642
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This research seeks to benefit the software engineering society by proposing comparative separation, a novel group fairness notion to evaluate the fairness of machine learning software on comparative judgment test data. Fairness issues have attracted increasing attention since machine learning software is increasingly used for high-stakes and high-risk decisions. It is the responsibility of all software developers to make their software accountable by ensuring that the machine learning software do not perform differently on different sensitive groups -- satisfying the separation criterion. However, evaluation of separation requires ground truth labels for each test data point. This motivates our work on analyzing whether separation can be evaluated on comparative judgment test data. Instead of asking humans to provide the ratings or categorical labels on each test data point, comparative judgments are made between pairs of data points such as A is better than B. According to the law of comparative judgment, providing such comparative judgments yields a lower cognitive burden for humans than providing ratings or categorical labels. This work first defines the novel fairness notion comparative separation on comparative judgment test data, and the metrics to evaluate comparative separation. Then, both theoretically and empirically, we show that in binary classification problems, comparative separation is equivalent to separation. Lastly, we analyze the number of test data points and test data pairs required to achieve the same level of statistical power in the evaluation of separation and comparative separation, respectively. This work is the first to explore fairness evaluation on comparative judgment test data. It shows the feasibility and the practical benefits of using comparative judgment test data for model evaluations.

Related papers

Understanding challenges to the interpretation of disaggregated evaluations of algorithmic fairness [49.35494016290887]
We show that equal performance across subgroups is an unreliable measure of fairness when data are representative of relevant populations but reflective of real-world disparities.<n>Our framework suggests complementing disaggregated evaluations with explicit causal assumptions and analysis to control for confounding and distribution shift.
arXiv Detail & Related papers (2025-06-04T17:40:31Z)
Whence Is A Model Fair? Fixing Fairness Bugs via Propensity Score Matching [0.49157446832511503]
We investigate whether the way training and testing data are sampled affects the reliability of fairness metrics.<n>Since training and test sets are often randomly sampled from the same population, bias present in the training data may still exist in the test data.<n>We propose FairMatch, a post-processing method that applies propensity score matching to evaluate and mitigate bias.
arXiv Detail & Related papers (2025-04-23T19:28:30Z)
Judgment2vec: Apply Graph Analytics to Searching and Recommendation of Similar Judgments [0.0]
In court practice, legal professionals rely on their training to provide opinions that resolve cases. Finding a similar case is challenging and often depends on experience, legal domain knowledge, and extensive labor hours. This research aims to automate the analysis of judgment text similarity.
arXiv Detail & Related papers (2024-08-08T11:37:32Z)
Efficient LLM Comparative Assessment: a Product of Experts Framework for Pairwise Comparisons [10.94304714004328]
This paper introduces a Product of Expert (PoE) framework for efficient Comparative Assessment. Individual comparisons are considered experts that provide information on a pair's score difference. PoE framework combines the information from these experts to yield an expression that can be maximized with respect to the underlying set of candidates.
arXiv Detail & Related papers (2024-05-09T16:45:27Z)
Bipartite Ranking Fairness through a Model Agnostic Ordering Adjustment [54.179859639868646]
We propose a model agnostic post-processing framework xOrder for achieving fairness in bipartite ranking. xOrder is compatible with various classification models and ranking fairness metrics, including supervised and unsupervised fairness metrics. We evaluate our proposed algorithm on four benchmark data sets and two real-world patient electronic health record repositories.
arXiv Detail & Related papers (2023-07-27T07:42:44Z)
LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise Comparisons using Large Language Models [55.60306377044225]
Large language models (LLMs) have enabled impressive zero-shot capabilities across various natural language tasks. This paper explores two options for exploiting the emergent abilities of LLMs for zero-shot NLG assessment. For moderate-sized open-source LLMs, such as FlanT5 and Llama2-chat, comparative assessment is superior to prompt scoring.
arXiv Detail & Related papers (2023-07-15T22:02:12Z)
Leveraging Human Feedback to Scale Educational Datasets: Combining Crowdworkers and Comparative Judgement [0.0]
This paper reports on two experiments investigating using non-expert crowdworkers and comparative judgement to evaluate student data. We found that using comparative judgement substantially improved inter-rater reliability on both tasks.
arXiv Detail & Related papers (2023-05-22T10:22:14Z)
Error Parity Fairness: Testing for Group Fairness in Regression Tasks [5.076419064097733]
This work presents error parity as a regression fairness notion and introduces a testing methodology to assess group fairness. It is followed by a suitable permutation test to compare groups on several statistics to explore disparities and identify impacted groups. Overall, the proposed regression fairness testing methodology fills a gap in the fair machine learning literature and may serve as a part of larger accountability assessments and algorithm audits.
arXiv Detail & Related papers (2022-08-16T17:47:20Z)
Doing Great at Estimating CATE? On the Neglected Assumptions in Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading. We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators. We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z)
Two-Sample Testing on Ranked Preference Data and the Role of Modeling Assumptions [57.77347280992548]
In this paper, we design two-sample tests for pairwise comparison data and ranking data. Our test requires essentially no assumptions on the distributions. By applying our two-sample test on real-world pairwise comparison data, we conclude that ratings and rankings provided by people are indeed distributed differently.
arXiv Detail & Related papers (2020-06-21T20:51:09Z)
Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data. There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups. We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.