Measuring Disparate Outcomes of Content Recommendation Algorithms with
Distributional Inequality Metrics
- URL: http://arxiv.org/abs/2202.01615v1
- Date: Thu, 3 Feb 2022 14:41:39 GMT
- Title: Measuring Disparate Outcomes of Content Recommendation Algorithms with
Distributional Inequality Metrics
- Authors: Tomo Lazovich, Luca Belli, Aaron Gonzales, Amanda Bower, Uthaipon
Tantipongpipat, Kristian Lum, Ferenc Huszar, Rumman Chowdhury
- Abstract summary: We evaluate a set of metrics originating from economics, distributional inequality metrics, and their ability to measure disparities in content exposure in the Twitter algorithmic timeline.
We show that we can use these metrics to identify content suggestion algorithms that contribute more strongly to skewed outcomes between users.
- Score: 5.74271110290378
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The harmful impacts of algorithmic decision systems have recently come into
focus, with many examples of systems such as machine learning (ML) models
amplifying existing societal biases. Most metrics attempting to quantify
disparities resulting from ML algorithms focus on differences between groups,
dividing users based on demographic identities and comparing model performance
or overall outcomes between these groups. However, in industry settings, such
information is often not available, and inferring these characteristics carries
its own risks and biases. Moreover, typical metrics that focus on a single
classifier's output ignore the complex network of systems that produce outcomes
in real-world settings. In this paper, we evaluate a set of metrics originating
from economics, distributional inequality metrics, and their ability to measure
disparities in content exposure in a production recommendation system, the
Twitter algorithmic timeline. We define desirable criteria for metrics to be
used in an operational setting, specifically by ML practitioners. We
characterize different types of engagement with content on Twitter using these
metrics, and use these results to evaluate the metrics with respect to the
desired criteria. We show that we can use these metrics to identify content
suggestion algorithms that contribute more strongly to skewed outcomes between
users. Overall, we conclude that these metrics can be useful tools for
understanding disparate outcomes in online social networks.
Related papers
- Comprehensive Equity Index (CEI): Definition and Application to Bias Evaluation in Biometrics [47.762333925222926]
We present a novel metric to quantify biased behaviors of machine learning models.
We focus on and apply it to the operational evaluation of face recognition systems.
arXiv Detail & Related papers (2024-09-03T14:19:38Z) - Leveraging a Cognitive Model to Measure Subjective Similarity of Human and GPT-4 Written Content [26.409490082213445]
Cosine similarity between two documents can be computed using token embeddings formed by Large Language Models (LLMs) such as GPT-4.
This similarity metric is beneficial in that it takes into account individual biases and constraints in a manner that is grounded in the cognitive mechanisms of decision making.
This dataset is used to demonstrate the benefits of leveraging a cognitive model to measure the subjective similarity of human participants in an educational setting.
arXiv Detail & Related papers (2024-08-30T21:54:13Z) - Machine Translation Meta Evaluation through Translation Accuracy
Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs.
This dataset aims to discover whether metrics can identify 68 translation accuracy errors.
We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z) - Truthful Meta-Explanations for Local Interpretability of Machine
Learning Models [10.342433824178825]
We present a local meta-explanation technique which builds on top of the truthfulness metric, which is a faithfulness-based metric.
We demonstrate the effectiveness of both the technique and the metric by concretely defining all the concepts and through experimentation.
arXiv Detail & Related papers (2022-12-07T08:32:04Z) - Analysis and Comparison of Classification Metrics [12.092755413404245]
Metrics for measuring the quality of system scores include the area under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC or Bayes risk.
We show how to use these metrics to compute a system's calibration loss and compare this metric with the widely-used expected calibration error (ECE)
arXiv Detail & Related papers (2022-09-12T16:06:10Z) - Re-Examining System-Level Correlations of Automatic Summarization
Evaluation Metrics [64.81682222169113]
How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations.
We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice.
arXiv Detail & Related papers (2022-04-21T15:52:14Z) - Estimation of Fair Ranking Metrics with Incomplete Judgments [70.37717864975387]
We propose a sampling strategy and estimation technique for four fair ranking metrics.
We formulate a robust and unbiased estimator which can operate even with very limited number of labeled items.
arXiv Detail & Related papers (2021-08-11T10:57:00Z) - The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process.
We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z) - Online Learning Demands in Max-min Fairness [91.37280766977923]
We describe mechanisms for the allocation of a scarce resource among multiple users in a way that is efficient, fair, and strategy-proof.
The mechanism is repeated for multiple rounds and a user's requirements can change on each round.
At the end of each round, users provide feedback about the allocation they received, enabling the mechanism to learn user preferences over time.
arXiv Detail & Related papers (2020-12-15T22:15:20Z) - Interpretable Assessment of Fairness During Model Evaluation [1.2183405753834562]
We introduce a novel hierarchical clustering algorithm to detect heterogeneity among users in given sets of sub-populations.
We demonstrate the performance of the algorithm on real data from LinkedIn.
arXiv Detail & Related papers (2020-10-26T02:31:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.