Related papers: CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation

CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation

URL: http://arxiv.org/abs/2206.08931v1
Date: Thu, 9 Jun 2022 23:31:17 GMT
Title: CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation
Authors: Mark Diaz, Ian D. Kivlichan, Rachel Rosen, Dylan K. Baker, Razvan Amironesei, Vinodkumar Prabhakaran, Emily Denton
Abstract summary: We survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation. We lay out the challenges in this space along two layers: (1) who the annotator is, and how the annotators' lived experiences can impact their annotations. We introduce a novel framework, CrowdWorkSheets, for dataset developers to facilitate transparent documentation of key decisions points at various stages of the data annotation pipeline.
Score: 8.447159556925182
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human annotated data plays a crucial role in machine learning (ML) research and development. However, the ethical considerations around the processes and decisions that go into dataset annotation have not received nearly enough attention. In this paper, we survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation. We synthesize these insights, and lay out the challenges in this space along two layers: (1) who the annotator is, and how the annotators' lived experiences can impact their annotations, and (2) the relationship between the annotators and the crowdsourcing platforms, and what that relationship affords them. Finally, we introduce a novel framework, CrowdWorkSheets, for dataset developers to facilitate transparent documentation of key decisions points at various stages of the data annotation pipeline: task formulation, selection of annotators, platform and infrastructure choices, dataset analysis and evaluation, and dataset release and maintenance.

Related papers

BookWorm: A Dataset for Character Description and Analysis [59.186325346763184]
We define two tasks: character description, which generates a brief factual profile, and character analysis, which offers an in-depth interpretation. We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and analyses. Our findings show that retrieval-based approaches outperform hierarchical ones in both tasks.
arXiv Detail & Related papers (2024-10-14T10:55:58Z)
Navigating Dataset Documentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face [46.60562029098208]
We analyze all 7,433 dataset documentation on Hugging Face. Our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis.
arXiv Detail & Related papers (2024-01-24T21:47:13Z)
Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data. We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z)
infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization. infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information. In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z)
Documenting Data Production Processes: A Participatory Approach for Data Work [4.811554861191618]
opacity of machine learning data is a significant threat to ethical data work and intelligible systems. Previous research has proposed standardized checklists to document datasets. This paper proposes a shift of perspective: from documenting datasets toward documenting data production.
arXiv Detail & Related papers (2022-07-11T15:39:02Z)
Data Cards: Purposeful and Transparent Dataset Documentation for Responsible AI [0.0]
We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets. Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders. We present frameworks that ground Data Cards in real-world utility and human-centricity.
arXiv Detail & Related papers (2022-04-03T13:49:36Z)
Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation [7.480972965984986]
We survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation. We lay out the challenges in this space along two layers: who the annotator is, and how the annotators' lived experiences can impact their annotations. We put forth a concrete set of recommendations and considerations for dataset developers at various stages of the ML data pipeline.
arXiv Detail & Related papers (2021-12-08T19:56:56Z)
Representation Matters: Assessing the Importance of Subgroup Allocations in Training Data [85.43008636875345]
We show that diverse representation in training data is key to increasing subgroup performances and achieving population level objectives. Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
arXiv Detail & Related papers (2021-03-05T00:27:08Z)
Data and its (dis)contents: A survey of dataset development and use in machine learning research [11.042648980854487]
We survey the many concerns raised about the way we collect and use data in machine learning. We advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.
arXiv Detail & Related papers (2020-12-09T22:13:13Z)
Weakly-Supervised Aspect-Based Sentiment Analysis via Joint Aspect-Sentiment Topic Embedding [71.2260967797055]
We propose a weakly-supervised approach for aspect-based sentiment analysis. We learn sentiment, aspect> joint topic embeddings in the word embedding space. We then use neural models to generalize the word-level discriminative information.
arXiv Detail & Related papers (2020-10-13T21:33:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.