Can I use this publicly available dataset to build commercial AI
software? Most likely not
- URL: http://arxiv.org/abs/2111.02374v1
- Date: Wed, 3 Nov 2021 17:44:06 GMT
- Title: Can I use this publicly available dataset to build commercial AI
software? Most likely not
- Authors: Gopi Krishnan Rajbahadur, Erika Tuck, Li Zi, Zhang Wei, Dayi Lin,
Boyuan Chen, Zhen Ming (Jack) Jiang, Daniel Morales German
- Abstract summary: We propose a new approach to assess the potential license compliance violations if a given publicly available dataset were to be used for building commercial AI software.
Our results show that there are risks of license violations on 5 of these 6 studied datasets if they were used for commercial purposes.
- Score: 8.853674186565934
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Publicly available datasets are one of the key drivers for commercial AI
software. The use of publicly available datasets (particularly for commercial
purposes) is governed by dataset licenses. These dataset licenses outline the
rights one is entitled to on a given dataset and the obligations that one must
fulfil to enjoy such rights without any license compliance violations. However,
unlike standardized Open Source Software (OSS) licenses, existing dataset
licenses are defined in an ad-hoc manner and do not clearly outline the rights
and obligations associated with their usage. This makes checking for potential
license compliance violations difficult. Further, a public dataset may be
hosted in multiple locations and created from multiple data sources each of
which may have different licenses. Hence, existing approaches on checking OSS
license compliance cannot be used. In this paper, we propose a new approach to
assess the potential license compliance violations if a given publicly
available dataset were to be used for building commercial AI software. We
conduct trials of our approach on two product groups within Huawei on 6
commonly used publicly available datasets. Our results show that there are
risks of license violations on 5 of these 6 studied datasets if they were used
for commercial purposes. Consequently, we provide recommendations for AI
engineers on how to better assess publicly available datasets for license
compliance violations.
Related papers
- Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset.
In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled.
We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z) - OpenDataLab: Empowering General Artificial Intelligence with Open Datasets [53.22840149601411]
This paper introduces OpenDataLab, a platform designed to bridge the gap between diverse data sources and the need for unified data processing.
OpenDataLab integrates a wide range of open-source AI datasets and enhances data acquisition efficiency through intelligent querying and high-speed downloading services.
We anticipate that OpenDataLab will significantly boost artificial general intelligence (AGI) research and facilitate advancements in related AI fields.
arXiv Detail & Related papers (2024-06-04T10:42:01Z) - On the Standardization of Behavioral Use Clauses and Their Adoption for
Responsible Licensing of AI [27.748532981456464]
In 2018, licenses with behaviorial-use clauses were proposed to give developers a framework for releasing AI assets.
As of the end of 2023, on the order of 40,000 software and model repositories have adopted responsible AI licenses.
arXiv Detail & Related papers (2024-02-07T22:29:42Z) - Catch the Butterfly: Peeking into the Terms and Conflicts among SPDX
Licenses [16.948633594354412]
Third-party libraries (TPLs) in software development has accelerated the creation of modern software.
Developers may inadvertently violate the licenses of TPLs, leading to legal issues.
There is a need for a high-quality license dataset that encompasses a broad range of mainstream licenses.
arXiv Detail & Related papers (2024-01-19T11:27:34Z) - The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI [41.32981860191232]
Legal and machine learning experts to systematically audit and trace 1800+ text datasets.
Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets.
frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+.
arXiv Detail & Related papers (2023-10-25T17:20:26Z) - The Software Heritage License Dataset (2022 Edition) [0.0]
The dataset consists of 6.9 million unique license files. Additional metadata about shipped license files is also provided.
The dataset can be used to conduct empirical studies on open source licensing, training of automated license cryptographics, natural language processing (NLP) analyses of legal texts.
arXiv Detail & Related papers (2023-08-22T08:01:07Z) - SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore [159.21914121143885]
We present SILO, a new language model that manages this risk-performance tradeoff during inference.
SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text.
Access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile.
arXiv Detail & Related papers (2023-08-08T17:58:15Z) - LiResolver: License Incompatibility Resolution for Open Source Software [13.28021004336228]
LiResolver is a fine-grained, scalable, and flexible tool to resolve license incompatibility issues for open source software.
Comprehensive experiments demonstrate the effectiveness of LiResolver, with 4.09% false positive (FP) rate and 0.02% false negative (FN) rate for incompatibility issue localization.
arXiv Detail & Related papers (2023-06-26T13:16:09Z) - Foundation Models and Fair Use [96.04664748698103]
In the U.S. and other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine.
In this work, we survey the potential risks of developing and deploying foundation models based on copyrighted content.
We discuss technical mitigations that can help foundation models stay in line with fair use.
arXiv Detail & Related papers (2023-03-28T03:58:40Z) - The Problem of Zombie Datasets:A Framework For Deprecating Datasets [55.878249096379804]
We examine the public afterlives of several prominent datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender.
We propose a dataset deprecation framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks.
arXiv Detail & Related papers (2021-10-18T20:13:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.