The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI
- URL: http://arxiv.org/abs/2310.16787v3
- Date: Sat, 4 Nov 2023 19:10:06 GMT
- Title: The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI
- Authors: Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien
Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara,
Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu,
Luis Villa, Sandy Pentland, Sara Hooker
- Abstract summary: Legal and machine learning experts to systematically audit and trace 1800+ text datasets.
Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets.
frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+.
- Score: 41.32981860191232
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The race to train language models on vast, diverse, and inconsistently
documented datasets has raised pressing concerns about the legal and ethical
risks for practitioners. To remedy these practices threatening data
transparency and understanding, we convene a multi-disciplinary effort between
legal and machine learning experts to systematically audit and trace 1800+ text
datasets. We develop tools and standards to trace the lineage of these
datasets, from their source, creators, series of license conditions,
properties, and subsequent use. Our landscape analysis highlights the sharp
divides in composition and focus of commercially open vs closed datasets, with
closed datasets monopolizing important categories: lower resource languages,
more creative tasks, richer topic variety, newer and more synthetic training
data. This points to a deepening divide in the types of data that are made
available under different license conditions, and heightened implications for
jurisdictional legal interpretations of copyright and fair use. We also observe
frequent miscategorization of licenses on widely used dataset hosting sites,
with license omission of 70%+ and error rates of 50%+. This points to a crisis
in misattribution and informed use of the most popular datasets driving many
recent breakthroughs. As a contribution to ongoing improvements in dataset
transparency and responsible use, we release our entire audit, with an
interactive UI, the Data Provenance Explorer, which allows practitioners to
trace and filter on data provenance for the most popular open source finetuning
data collections: www.dataprovenance.org.
Related papers
- A Systematic Review of NeurIPS Dataset Management Practices [7.974245534539289]
We present a systematic review of datasets published at the NeurIPS track, focusing on four key aspects: provenance, distribution, ethical disclosure, and licensing.
Our findings reveal that dataset provenance is often unclear due to ambiguous filtering and curation processes.
These inconsistencies underscore the urgent need for standardized data infrastructures for the publication and management of datasets.
arXiv Detail & Related papers (2024-10-31T23:55:41Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - Unsupervised Anomaly Detection for Auditing Data and Impact of
Categorical Encodings [20.37092575427039]
Vehicle Claims dataset consists of fraudulent insurance claims for automotive repairs.
We tackle the common problem of missing benchmark datasets for anomaly detection.
The dataset is evaluated on shallow and deep learning methods.
arXiv Detail & Related papers (2022-10-25T14:33:17Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Customs Import Declaration Datasets [12.306592823750385]
We introduce an import declaration dataset to facilitate the collaboration between domain experts in customs administrations and researchers from diverse domains.
The dataset contains 54,000 artificially generated trades with 22 key attributes.
We empirically show that more advanced algorithms can better detect fraud.
arXiv Detail & Related papers (2022-08-04T06:20:20Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Deep Transfer Learning for Multi-source Entity Linkage via Domain
Adaptation [63.24594955429465]
Multi-source entity linkage is critical in high-impact applications such as data cleaning and user stitching.
AdaMEL is a deep transfer learning framework that learns generic high-level knowledge to perform multi-source entity linkage.
Our framework achieves state-of-the-art results with 8.21% improvement on average over methods based on supervised learning.
arXiv Detail & Related papers (2021-10-27T15:20:41Z) - The Problem of Zombie Datasets:A Framework For Deprecating Datasets [55.878249096379804]
We examine the public afterlives of several prominent datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender.
We propose a dataset deprecation framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks.
arXiv Detail & Related papers (2021-10-18T20:13:51Z) - Multimodal datasets: misogyny, pornography, and malignant stereotypes [2.8682942808330703]
We examine the recently released LAION-400M dataset, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset.
We found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content.
arXiv Detail & Related papers (2021-10-05T11:47:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.