Position: Measure Dataset Diversity, Don't Just Claim It
- URL: http://arxiv.org/abs/2407.08188v1
- Date: Thu, 11 Jul 2024 05:13:27 GMT
- Title: Position: Measure Dataset Diversity, Don't Just Claim It
- Authors: Dora Zhao, Jerone T. A. Andrews, Orestis Papakyriakopoulos, Alice Xiang,
- Abstract summary: dataset curators frequently employ value-laden terms such as diversity, bias, and quality to characterize datasets.
Despite their prevalence, these terms lack clear definitions and validation.
Our research explores the implications of this issue by analyzing "diversity" across 135 image and text datasets.
- Score: 8.551188808401294
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning (ML) datasets, often perceived as neutral, inherently encapsulate abstract and disputed social constructs. Dataset curators frequently employ value-laden terms such as diversity, bias, and quality to characterize datasets. Despite their prevalence, these terms lack clear definitions and validation. Our research explores the implications of this issue by analyzing "diversity" across 135 image and text datasets. Drawing from social sciences, we apply principles from measurement theory to identify considerations and offer recommendations for conceptualizing, operationalizing, and evaluating diversity in datasets. Our findings have broader implications for ML research, advocating for a more nuanced and precise approach to handling value-laden properties in dataset construction.
Related papers
- CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models [58.57987316300529]
Large Language Models (LLMs) are increasingly deployed to handle various natural language processing (NLP) tasks.
To evaluate the biases exhibited by LLMs, researchers have recently proposed a variety of datasets.
We propose CEB, a Compositional Evaluation Benchmark that covers different types of bias across different social groups and tasks.
arXiv Detail & Related papers (2024-07-02T16:31:37Z) - Shifts 2.0: Extending The Dataset of Real Distributional Shifts [25.31085238930148]
We extend the Shifts dataset with two datasets sourced from industrial, high-risk applications of high societal importance.
We consider the tasks of segmentation of white matter Multiple Sclerosis lesions in 3D magnetic resonance brain images and the estimation of power consumption in marine cargo vessels.
These new datasets will allow researchers to further explore robust generalization and uncertainty estimation in new situations.
arXiv Detail & Related papers (2022-06-30T16:51:52Z) - Assessing Demographic Bias Transfer from Dataset to Model: A Case Study
in Facial Expression Recognition [1.5340540198612824]
Two metrics focus on the representational and stereotypical bias of the dataset, and the third one on the residual bias of the trained model.
We demonstrate the usefulness of the metrics by applying them to a FER problem based on the popular Affectnet dataset.
arXiv Detail & Related papers (2022-05-20T09:40:42Z) - Whose Ground Truth? Accounting for Individual and Collective Identities
Underlying Dataset Annotation [7.480972965984986]
We survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation.
We lay out the challenges in this space along two layers: who the annotator is, and how the annotators' lived experiences can impact their annotations.
We put forth a concrete set of recommendations and considerations for dataset developers at various stages of the ML data pipeline.
arXiv Detail & Related papers (2021-12-08T19:56:56Z) - Shifts: A Dataset of Real Distributional Shift Across Multiple
Large-Scale Tasks [44.61070965407907]
Given the current state of the field, a standardized large-scale dataset of tasks across a range of modalities affected by distributional shifts is necessary.
We propose the emphShifts dataset for evaluation of uncertainty estimates and robustness to distributional shift.
arXiv Detail & Related papers (2021-07-15T16:59:34Z) - Representation Matters: Assessing the Importance of Subgroup Allocations
in Training Data [85.43008636875345]
We show that diverse representation in training data is key to increasing subgroup performances and achieving population level objectives.
Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
arXiv Detail & Related papers (2021-03-05T00:27:08Z) - Towards Understanding Sample Variance in Visually Grounded Language
Generation: Evaluations and Observations [67.4375210552593]
We design experiments to understand an important but often ignored problem in visually grounded language generation.
Given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance?
We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task.
arXiv Detail & Related papers (2020-10-07T20:45:14Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z) - REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets [64.76453161039973]
REVISE (REvealing VIsual biaSEs) is a tool that assists in the investigation of a visual dataset.
It surfacing potential biases along three dimensions: (1) object-based, (2) person-based, and (3) geography-based.
arXiv Detail & Related papers (2020-04-16T23:54:37Z) - A Philosophy of Data [91.3755431537592]
We work from the fundamental properties necessary for statistical computation to a definition of statistical data.
We argue that the need for useful data to be commensurable rules out an understanding of properties as fundamentally unique or equal.
With our increasing reliance on data and data technologies, these two characteristics of data affect our collective conception of reality.
arXiv Detail & Related papers (2020-04-15T14:47:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.