Mitigating dataset harms requires stewardship: Lessons from 1000 papers
- URL: http://arxiv.org/abs/2108.02922v1
- Date: Fri, 6 Aug 2021 02:52:36 GMT
- Title: Mitigating dataset harms requires stewardship: Lessons from 1000 papers
- Authors: Kenny Peng and Arunesh Mathur and Arvind Narayanan
- Abstract summary: We study three influential face and person recognition datasets by analyzing nearly 1000 papers.
We find that the creation of derivative datasets and models, broader technological and social change, the lack of clarity of licenses, and dataset management practices can introduce a wide range of ethical concerns.
- Score: 8.469320512479456
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Concerns about privacy, bias, and harmful applications have shone a light on
the ethics of machine learning datasets, even leading to the retraction of
prominent datasets including DukeMTMC, MS-Celeb-1M, TinyImages, and VGGFace2.
In response, the machine learning community has called for higher ethical
standards, transparency efforts, and technical fixes in the dataset creation
process. The premise of our work is that these efforts can be more effective if
informed by an understanding of how datasets are used in practice in the
research community. We study three influential face and person recognition
datasets - DukeMTMC, MS-Celeb-1M, and Labeled Faces in the Wild (LFW) - by
analyzing nearly 1000 papers that cite them. We found that the creation of
derivative datasets and models, broader technological and social change, the
lack of clarity of licenses, and dataset management practices can introduce a
wide range of ethical concerns. We conclude by suggesting a distributed
approach that can mitigate these harms, making recommendations to dataset
creators, conference program committees, dataset users, and the broader
research community.
Related papers
- A Critical Field Guide for Working with Machine Learning Datasets [0.716879432974126]
Critical Field Guide for Working with Machine Learning datasets suggests practical guidance for conscientious dataset stewardship.
Offers questions, suggestions, strategies, and resources for working with existing machine learning datasets.
Students, journalists, artists, researchers, and developers can be more capable of avoiding the problems unique to datasets.
arXiv Detail & Related papers (2025-01-26T11:43:33Z) - Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness [65.01625761120924]
We argue that a valuable sample should be informative of the task, non-redundant, and represent the sample distribution (i.e., not an outlier)
We propose a collaborative framework, DataTailor, which leverages three key principles--informativeness, uniqueness, and representativeness--for effective data selection.
Experiments on various benchmarks demonstrate that DataTailor achieves 100.8% of the performance of full-data fine-tuning with only 15% of the data.
arXiv Detail & Related papers (2024-12-09T08:36:10Z) - Lazy Data Practices Harm Fairness Research [49.02318458244464]
We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings.
Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research.
This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
arXiv Detail & Related papers (2024-04-26T09:51:24Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - Eagle: Ethical Dataset Given from Real Interactions [74.7319697510621]
We create datasets extracted from real interactions between ChatGPT and users that exhibit social biases, toxicity, and immoral problems.
Our experiments show that Eagle captures complementary aspects, not covered by existing datasets proposed for evaluation and mitigation of such ethical challenges.
arXiv Detail & Related papers (2024-02-22T03:46:02Z) - The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI [41.32981860191232]
Legal and machine learning experts to systematically audit and trace 1800+ text datasets.
Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets.
frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+.
arXiv Detail & Related papers (2023-10-25T17:20:26Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Metadata Archaeology: Unearthing Data Subsets by Leveraging Training
Dynamics [3.9627732117855414]
We focus on providing a unified and efficient framework for Metadata Archaeology.
We curate different subsets of data that might exist in a dataset.
We leverage differences in learning dynamics between these probe suites to infer metadata of interest.
arXiv Detail & Related papers (2022-09-20T21:52:39Z) - The Problem of Zombie Datasets:A Framework For Deprecating Datasets [55.878249096379804]
We examine the public afterlives of several prominent datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender.
We propose a dataset deprecation framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks.
arXiv Detail & Related papers (2021-10-18T20:13:51Z) - Multimodal datasets: misogyny, pornography, and malignant stereotypes [2.8682942808330703]
We examine the recently released LAION-400M dataset, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset.
We found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.
arXiv Detail & Related papers (2021-10-05T11:47:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.