The Problem of Zombie Datasets:A Framework For Deprecating Datasets
- URL: http://arxiv.org/abs/2111.04424v1
- Date: Mon, 18 Oct 2021 20:13:51 GMT
- Title: The Problem of Zombie Datasets:A Framework For Deprecating Datasets
- Authors: Frances Corry, Hamsini Sridharan, Alexandra Sasha Luccioni, Mike
Ananny, Jason Schultz, Kate Crawford
- Abstract summary: We examine the public afterlives of several prominent datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender.
We propose a dataset deprecation framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks.
- Score: 55.878249096379804
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: What happens when a machine learning dataset is deprecated for legal,
ethical, or technical reasons, but continues to be widely used? In this paper,
we examine the public afterlives of several prominent deprecated or redacted
datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC,
Brainwash, and HRT Transgender, in order to inform a framework for more
consistent, ethical, and accountable dataset deprecation. Building on prior
research, we find that there is a lack of consistency, transparency, and
centralized sourcing of information on the deprecation of datasets, and as
such, these datasets and their derivatives continue to be cited in papers and
circulate online. These datasets that never die -- which we term "zombie
datasets" -- continue to inform the design of production-level systems, causing
technical, legal, and ethical challenges; in so doing, they risk perpetuating
the harms that prompted their supposed withdrawal, including concerns around
bias, discrimination, and privacy. Based on this analysis, we propose a Dataset
Deprecation Framework that includes considerations of risk, mitigation of
impact, appeal mechanisms, timeline, post-deprecation protocol, and publication
checks that can be adapted and implemented by the machine learning community.
Drawing on work on datasheets and checklists, we further offer two sample
dataset deprecation sheets and propose a centralized repository that tracks
which datasets have been deprecated and could be incorporated into the
publication protocols of venues like NeurIPS.
Related papers
- A Systematic Review of NeurIPS Dataset Management Practices [7.974245534539289]
We present a systematic review of datasets published at the NeurIPS track, focusing on four key aspects: provenance, distribution, ethical disclosure, and licensing.
Our findings reveal that dataset provenance is often unclear due to ambiguous filtering and curation processes.
These inconsistencies underscore the urgent need for standardized data infrastructures for the publication and management of datasets.
arXiv Detail & Related papers (2024-10-31T23:55:41Z) - The MERIT Dataset: Modelling and Efficiently Rendering Interpretable Transcripts [0.0]
This paper introduces the MERIT dataset, a fully labeled dataset within the context of school reports.
By its nature, the MERIT dataset can potentially include biases in a controlled way, making it a valuable tool to benchmark biases induced in Language Models (LLMs)
To demonstrate the dataset's utility, we present a benchmark with token classification models, showing that the dataset poses a significant challenge even for SOTA models.
arXiv Detail & Related papers (2024-08-31T12:56:38Z) - Fact Checking Beyond Training Set [64.88575826304024]
We show that the retriever-reader suffers from performance deterioration when it is trained on labeled data from one domain and used in another domain.
We propose an adversarial algorithm to make the retriever component robust against distribution shift.
We then construct eight fact checking scenarios from these datasets, and compare our model to a set of strong baseline models.
arXiv Detail & Related papers (2024-03-27T15:15:14Z) - A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models [52.49582606341111]
Copyright law confers creators the exclusive rights to reproduce, distribute, and monetize their creative works.
Recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement.
We introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset.
arXiv Detail & Related papers (2024-01-04T11:14:01Z) - The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI [41.32981860191232]
Legal and machine learning experts to systematically audit and trace 1800+ text datasets.
Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets.
frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+.
arXiv Detail & Related papers (2023-10-25T17:20:26Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation
of Videos [106.06278332186106]
Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction.
Numerous limitations exist within existing public MSMO datasets.
We have meticulously curated the textbfMMSum dataset.
arXiv Detail & Related papers (2023-06-07T07:43:11Z) - Multimodal datasets: misogyny, pornography, and malignant stereotypes [2.8682942808330703]
We examine the recently released LAION-400M dataset, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset.
We found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.
arXiv Detail & Related papers (2021-10-05T11:47:27Z) - Mitigating dataset harms requires stewardship: Lessons from 1000 papers [8.469320512479456]
We study three influential face and person recognition datasets by analyzing nearly 1000 papers.
We find that the creation of derivative datasets and models, broader technological and social change, the lack of clarity of licenses, and dataset management practices can introduce a wide range of ethical concerns.
arXiv Detail & Related papers (2021-08-06T02:52:36Z) - Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier.
An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.