Towards Reliable Assessments of Demographic Disparities in Multi-Label
Image Classifiers
- URL: http://arxiv.org/abs/2302.08572v1
- Date: Thu, 16 Feb 2023 20:34:54 GMT
- Title: Towards Reliable Assessments of Demographic Disparities in Multi-Label
Image Classifiers
- Authors: Melissa Hall, Bobbie Chern, Laura Gustafson, Denisse Ventura, Harshad
Kulkarni, Candace Ross, Nicolas Usunier
- Abstract summary: We consider multi-label image classification and, specifically, object categorization tasks.
Design choices and trade-offs for measurement involve more nuance than discussed in prior computer vision literature.
We identify several design choices that look merely like implementation details but significantly impact the conclusions of assessments.
- Score: 11.973749734226852
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Disaggregated performance metrics across demographic groups are a hallmark of
fairness assessments in computer vision. These metrics successfully
incentivized performance improvements on person-centric tasks such as face
analysis and are used to understand risks of modern models. However, there is a
lack of discussion on the vulnerabilities of these measurements for more
complex computer vision tasks. In this paper, we consider multi-label image
classification and, specifically, object categorization tasks. First, we
highlight design choices and trade-offs for measurement that involve more
nuance than discussed in prior computer vision literature. These challenges are
related to the necessary scale of data, definition of groups for images, choice
of metric, and dataset imbalances. Next, through two case studies using modern
vision models, we demonstrate that naive implementations of these assessments
are brittle. We identify several design choices that look merely like
implementation details but significantly impact the conclusions of assessments,
both in terms of magnitude and direction (on which group the classifiers work
best) of disparities. Based on ablation studies, we propose some
recommendations to increase the reliability of these assessments. Finally,
through a qualitative analysis we find that concepts with large disparities
tend to have varying definitions and representations between groups, with
inconsistencies across datasets and annotators. While this result suggests
avenues for mitigation through more consistent data collection, it also
highlights that ambiguous label definitions remain a challenge when performing
model assessments. Vision models are expanding and becoming more ubiquitous; it
is even more important that our disparity assessments accurately reflect the
true performance of models.
Related papers
- Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance.
Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z) - VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs.
Existing benchmarks are often limited in scope, focusing mainly on object hallucinations.
We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z) - Corpus Considerations for Annotator Modeling and Scaling [9.263562546969695]
We show that the commonly used user token model consistently outperforms more complex models.
Our findings shed light on the relationship between corpus statistics and annotator modeling performance.
arXiv Detail & Related papers (2024-04-02T22:27:24Z) - Fairness meets Cross-Domain Learning: a new perspective on Models and
Metrics [80.07271410743806]
We study the relationship between cross-domain learning (CD) and model fairness.
We introduce a benchmark on face and medical images spanning several demographic groups as well as classification and localization tasks.
Our study covers 14 CD approaches alongside three state-of-the-art fairness algorithms and shows how the former can outperform the latter.
arXiv Detail & Related papers (2023-03-25T09:34:05Z) - DeAR: Debiasing Vision-Language Models with Additive Residuals [5.672132510411465]
Large pre-trained vision-language models (VLMs) provide rich, adaptable image and text representations.
These models suffer from societal biases owing to the skewed distribution of various identity groups in the training data.
We present DeAR, a novel debiasing method that learns additive residual image representations to offset the original representations.
arXiv Detail & Related papers (2023-03-18T14:57:43Z) - Fairness Increases Adversarial Vulnerability [50.90773979394264]
This paper shows the existence of a dichotomy between fairness and robustness, and analyzes when achieving fairness decreases the model robustness to adversarial samples.
Experiments on non-linear models and different architectures validate the theoretical findings in multiple vision domains.
The paper proposes a simple, yet effective, solution to construct models achieving good tradeoffs between fairness and robustness.
arXiv Detail & Related papers (2022-11-21T19:55:35Z) - Exploring the Trade-off between Plausibility, Change Intensity and
Adversarial Power in Counterfactual Explanations using Multi-objective
Optimization [73.89239820192894]
We argue that automated counterfactual generation should regard several aspects of the produced adversarial instances.
We present a novel framework for the generation of counterfactual examples.
arXiv Detail & Related papers (2022-05-20T15:02:53Z) - Fairness Indicators for Systematic Assessments of Visual Feature
Extractors [21.141633753573764]
We propose three fairness indicators, which aim at quantifying harms and biases of visual systems.
Our indicators use existing publicly available datasets collected for fairness evaluations.
These indicators are not intended to be a substitute for a thorough analysis of the broader impact of the new computer vision technologies.
arXiv Detail & Related papers (2022-02-15T17:45:33Z) - Characterizing Fairness Over the Set of Good Models Under Selective
Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance.
We provide tractable algorithms to compute the range of attainable group-level predictive disparities.
We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.