Related papers: Measuring Disparate Outcomes of Content Recommendation Algorithms with Distributional Inequality Metrics

Measuring Disparate Outcomes of Content Recommendation Algorithms with Distributional Inequality Metrics

URL: http://arxiv.org/abs/2202.01615v1
Date: Thu, 3 Feb 2022 14:41:39 GMT
Title: Measuring Disparate Outcomes of Content Recommendation Algorithms with Distributional Inequality Metrics
Authors: Tomo Lazovich, Luca Belli, Aaron Gonzales, Amanda Bower, Uthaipon Tantipongpipat, Kristian Lum, Ferenc Huszar, Rumman Chowdhury
Abstract summary: We evaluate a set of metrics originating from economics, distributional inequality metrics, and their ability to measure disparities in content exposure in the Twitter algorithmic timeline. We show that we can use these metrics to identify content suggestion algorithms that contribute more strongly to skewed outcomes between users.
Score: 5.74271110290378
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The harmful impacts of algorithmic decision systems have recently come into focus, with many examples of systems such as machine learning (ML) models amplifying existing societal biases. Most metrics attempting to quantify disparities resulting from ML algorithms focus on differences between groups, dividing users based on demographic identities and comparing model performance or overall outcomes between these groups. However, in industry settings, such information is often not available, and inferring these characteristics carries its own risks and biases. Moreover, typical metrics that focus on a single classifier's output ignore the complex network of systems that produce outcomes in real-world settings. In this paper, we evaluate a set of metrics originating from economics, distributional inequality metrics, and their ability to measure disparities in content exposure in a production recommendation system, the Twitter algorithmic timeline. We define desirable criteria for metrics to be used in an operational setting, specifically by ML practitioners. We characterize different types of engagement with content on Twitter using these metrics, and use these results to evaluate the metrics with respect to the desired criteria. We show that we can use these metrics to identify content suggestion algorithms that contribute more strongly to skewed outcomes between users. Overall, we conclude that these metrics can be useful tools for understanding disparate outcomes in online social networks.

Related papers

Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings [77.20838441870151]
We use an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments. We collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts. Our results indicate that edit distance exhibits the highest correlation with the online metric, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation.
arXiv Detail & Related papers (2024-10-15T20:32:07Z)
Comprehensive Equity Index (CEI): Definition and Application to Bias Evaluation in Biometrics [47.762333925222926]
We present a novel metric to quantify biased behaviors of machine learning models. We focus on and apply it to the operational evaluation of face recognition systems.
arXiv Detail & Related papers (2024-09-03T14:19:38Z)
Leveraging a Cognitive Model to Measure Subjective Similarity of Human and GPT-4 Written Content [26.409490082213445]
Cosine similarity between two documents can be computed using token embeddings formed by Large Language Models (LLMs) such as GPT-4. This similarity metric is beneficial in that it takes into account individual biases and constraints in a manner that is grounded in the cognitive mechanisms of decision making. This dataset is used to demonstrate the benefits of leveraging a cognitive model to measure the subjective similarity of human participants in an educational setting.
arXiv Detail & Related papers (2024-08-30T21:54:13Z)
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs. This dataset aims to discover whether metrics can identify 68 translation accuracy errors. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z)
Truthful Meta-Explanations for Local Interpretability of Machine Learning Models [10.342433824178825]
We present a local meta-explanation technique which builds on top of the truthfulness metric, which is a faithfulness-based metric. We demonstrate the effectiveness of both the technique and the metric by concretely defining all the concepts and through experimentation.
arXiv Detail & Related papers (2022-12-07T08:32:04Z)
Analysis and Comparison of Classification Metrics [12.092755413404245]
Metrics for measuring the quality of system scores include the area under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC or Bayes risk. We show how to use these metrics to compute a system's calibration loss and compare this metric with the widely-used expected calibration error (ECE)
arXiv Detail & Related papers (2022-09-12T16:06:10Z)
Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z)
Estimation of Fair Ranking Metrics with Incomplete Judgments [70.37717864975387]
We propose a sampling strategy and estimation technique for four fair ranking metrics. We formulate a robust and unbiased estimator which can operate even with very limited number of labeled items.
arXiv Detail & Related papers (2021-08-11T10:57:00Z)
The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process. We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z)
Online Learning Demands in Max-min Fairness [91.37280766977923]
We describe mechanisms for the allocation of a scarce resource among multiple users in a way that is efficient, fair, and strategy-proof. The mechanism is repeated for multiple rounds and a user's requirements can change on each round. At the end of each round, users provide feedback about the allocation they received, enabling the mechanism to learn user preferences over time.
arXiv Detail & Related papers (2020-12-15T22:15:20Z)
Interpretable Assessment of Fairness During Model Evaluation [1.2183405753834562]
We introduce a novel hierarchical clustering algorithm to detect heterogeneity among users in given sets of sub-populations. We demonstrate the performance of the algorithm on real data from LinkedIn.
arXiv Detail & Related papers (2020-10-26T02:31:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.