Can I use this publicly available dataset to build commercial AI
software? Most likely not
- URL: http://arxiv.org/abs/2111.02374v1
- Date: Wed, 3 Nov 2021 17:44:06 GMT
- Title: Can I use this publicly available dataset to build commercial AI
software? Most likely not
- Authors: Gopi Krishnan Rajbahadur, Erika Tuck, Li Zi, Zhang Wei, Dayi Lin,
Boyuan Chen, Zhen Ming (Jack) Jiang, Daniel Morales German
- Abstract summary: We propose a new approach to assess the potential license compliance violations if a given publicly available dataset were to be used for building commercial AI software.
Our results show that there are risks of license violations on 5 of these 6 studied datasets if they were used for commercial purposes.
- Score: 8.853674186565934
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Publicly available datasets are one of the key drivers for commercial AI
software. The use of publicly available datasets (particularly for commercial
purposes) is governed by dataset licenses. These dataset licenses outline the
rights one is entitled to on a given dataset and the obligations that one must
fulfil to enjoy such rights without any license compliance violations. However,
unlike standardized Open Source Software (OSS) licenses, existing dataset
licenses are defined in an ad-hoc manner and do not clearly outline the rights
and obligations associated with their usage. This makes checking for potential
license compliance violations difficult. Further, a public dataset may be
hosted in multiple locations and created from multiple data sources each of
which may have different licenses. Hence, existing approaches on checking OSS
license compliance cannot be used. In this paper, we propose a new approach to
assess the potential license compliance violations if a given publicly
available dataset were to be used for building commercial AI software. We
conduct trials of our approach on two product groups within Huawei on 6
commonly used publicly available datasets. Our results show that there are
risks of license violations on 5 of these 6 studied datasets if they were used
for commercial purposes. Consequently, we provide recommendations for AI
engineers on how to better assess publicly available datasets for license
compliance violations.
Related papers
- LicenseGPT: A Fine-tuned Foundation Model for Publicly Available Dataset License Compliance [27.595354325922436]
We introduce LicenseGPT, a fine-tuned foundation model (FM) specifically designed for dataset license compliance analysis.
We evaluate existing legal FMs and find that the best-performing model achieves a Prediction Agreement (PA) of only 43.75%.
We demonstrate that LicenseGPT reduces analysis time by 94.44%, from 108 seconds to 6 seconds per license, without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T19:04:13Z) - "They've Stolen My GPL-Licensed Model!": Toward Standardized and Transparent Model Licensing [30.19362102481241]
We develop a new vocabulary for ML workflow management and encoded license rules to enable ontological reasoning for analyzing rights granting and compliance issues.
Our analysis tool is built on Turtle language and Notation3 reasoning engine, envisioned as first step toward Linked Open Model Data.
arXiv Detail & Related papers (2024-12-16T06:52:09Z) - Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset.
In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled.
We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z) - OSS License Identification at Scale: A Comprehensive Dataset Using World of Code [4.954816514146113]
This study presents a reusable and comprehensive dataset of open source software (OSS) licenses.
We found and identified 5.5 million distinct license blobs in OSS projects.
The dataset is open, providing a valuable resource for developers, researchers, and legal professionals in the OSS community.
arXiv Detail & Related papers (2024-09-07T13:34:55Z) - OpenDataLab: Empowering General Artificial Intelligence with Open Datasets [53.22840149601411]
This paper introduces OpenDataLab, a platform designed to bridge the gap between diverse data sources and the need for unified data processing.
OpenDataLab integrates a wide range of open-source AI datasets and enhances data acquisition efficiency through intelligent querying and high-speed downloading services.
We anticipate that OpenDataLab will significantly boost artificial general intelligence (AGI) research and facilitate advancements in related AI fields.
arXiv Detail & Related papers (2024-06-04T10:42:01Z) - Catch the Butterfly: Peeking into the Terms and Conflicts among SPDX
Licenses [16.948633594354412]
Third-party libraries (TPLs) in software development has accelerated the creation of modern software.
Developers may inadvertently violate the licenses of TPLs, leading to legal issues.
There is a need for a high-quality license dataset that encompasses a broad range of mainstream licenses.
arXiv Detail & Related papers (2024-01-19T11:27:34Z) - The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI [41.32981860191232]
Legal and machine learning experts to systematically audit and trace 1800+ text datasets.
Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets.
frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+.
arXiv Detail & Related papers (2023-10-25T17:20:26Z) - SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore [159.21914121143885]
We present SILO, a new language model that manages this risk-performance tradeoff during inference.
SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text.
Access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile.
arXiv Detail & Related papers (2023-08-08T17:58:15Z) - Foundation Models and Fair Use [96.04664748698103]
In the U.S. and other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine.
In this work, we survey the potential risks of developing and deploying foundation models based on copyrighted content.
We discuss technical mitigations that can help foundation models stay in line with fair use.
arXiv Detail & Related papers (2023-03-28T03:58:40Z) - The Problem of Zombie Datasets:A Framework For Deprecating Datasets [55.878249096379804]
We examine the public afterlives of several prominent datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender.
We propose a dataset deprecation framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks.
arXiv Detail & Related papers (2021-10-18T20:13:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.