Whose Ground Truth? Accounting for Individual and Collective Identities
Underlying Dataset Annotation
- URL: http://arxiv.org/abs/2112.04554v1
- Date: Wed, 8 Dec 2021 19:56:56 GMT
- Title: Whose Ground Truth? Accounting for Individual and Collective Identities
Underlying Dataset Annotation
- Authors: Emily Denton, Mark D\'iaz, Ian Kivlichan, Vinodkumar Prabhakaran,
Rachel Rosen
- Abstract summary: We survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation.
We lay out the challenges in this space along two layers: who the annotator is, and how the annotators' lived experiences can impact their annotations.
We put forth a concrete set of recommendations and considerations for dataset developers at various stages of the ML data pipeline.
- Score: 7.480972965984986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human annotations play a crucial role in machine learning (ML) research and
development. However, the ethical considerations around the processes and
decisions that go into building ML datasets has not received nearly enough
attention. In this paper, we survey an array of literature that provides
insights into ethical considerations around crowdsourced dataset annotation. We
synthesize these insights, and lay out the challenges in this space along two
layers: (1) who the annotator is, and how the annotators' lived experiences can
impact their annotations, and (2) the relationship between the annotators and
the crowdsourcing platforms and what that relationship affords them. Finally,
we put forth a concrete set of recommendations and considerations for dataset
developers at various stages of the ML data pipeline: task formulation,
selection of annotators, platform and infrastructure choices, dataset analysis
and evaluation, and dataset documentation and release.
Related papers
- Position: Measure Dataset Diversity, Don't Just Claim It [8.551188808401294]
dataset curators frequently employ value-laden terms such as diversity, bias, and quality to characterize datasets.
Despite their prevalence, these terms lack clear definitions and validation.
Our research explores the implications of this issue by analyzing "diversity" across 135 image and text datasets.
arXiv Detail & Related papers (2024-07-11T05:13:27Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Navigating Dataset Documentations in AI: A Large-Scale Analysis of
Dataset Cards on Hugging Face [46.60562029098208]
We analyze all 7,433 dataset documentation on Hugging Face.
Our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis.
arXiv Detail & Related papers (2024-01-24T21:47:13Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - CrowdWorkSheets: Accounting for Individual and Collective Identities
Underlying Crowdsourced Dataset Annotation [8.447159556925182]
We survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation.
We lay out the challenges in this space along two layers: (1) who the annotator is, and how the annotators' lived experiences can impact their annotations.
We introduce a novel framework, CrowdWorkSheets, for dataset developers to facilitate transparent documentation of key decisions points at various stages of the data annotation pipeline.
arXiv Detail & Related papers (2022-06-09T23:31:17Z) - Data Cards: Purposeful and Transparent Dataset Documentation for
Responsible AI [0.0]
We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets.
Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders.
We present frameworks that ground Data Cards in real-world utility and human-centricity.
arXiv Detail & Related papers (2022-04-03T13:49:36Z) - Representation Matters: Assessing the Importance of Subgroup Allocations
in Training Data [85.43008636875345]
We show that diverse representation in training data is key to increasing subgroup performances and achieving population level objectives.
Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
arXiv Detail & Related papers (2021-03-05T00:27:08Z) - Data and its (dis)contents: A survey of dataset development and use in
machine learning research [11.042648980854487]
We survey the many concerns raised about the way we collect and use data in machine learning.
We advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.
arXiv Detail & Related papers (2020-12-09T22:13:13Z) - An Ethical Highlighter for People-Centric Dataset Creation [62.886916477131486]
We propose an analytical framework to guide ethical evaluation of existing datasets and to serve future dataset creators in avoiding missteps.
Our work is informed by a review and analysis of prior works and highlights where such ethical challenges arise.
arXiv Detail & Related papers (2020-11-27T07:18:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.