Related papers: Can I use this publicly available dataset to build commercial AI software? Most likely not

Can I use this publicly available dataset to build commercial AI software? Most likely not

URL: http://arxiv.org/abs/2111.02374v1
Date: Wed, 3 Nov 2021 17:44:06 GMT
Title: Can I use this publicly available dataset to build commercial AI software? Most likely not
Authors: Gopi Krishnan Rajbahadur, Erika Tuck, Li Zi, Zhang Wei, Dayi Lin, Boyuan Chen, Zhen Ming (Jack) Jiang, Daniel Morales German
Abstract summary: We propose a new approach to assess the potential license compliance violations if a given publicly available dataset were to be used for building commercial AI software. Our results show that there are risks of license violations on 5 of these 6 studied datasets if they were used for commercial purposes.
Score: 8.853674186565934
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Publicly available datasets are one of the key drivers for commercial AI software. The use of publicly available datasets (particularly for commercial purposes) is governed by dataset licenses. These dataset licenses outline the rights one is entitled to on a given dataset and the obligations that one must fulfil to enjoy such rights without any license compliance violations. However, unlike standardized Open Source Software (OSS) licenses, existing dataset licenses are defined in an ad-hoc manner and do not clearly outline the rights and obligations associated with their usage. This makes checking for potential license compliance violations difficult. Further, a public dataset may be hosted in multiple locations and created from multiple data sources each of which may have different licenses. Hence, existing approaches on checking OSS license compliance cannot be used. In this paper, we propose a new approach to assess the potential license compliance violations if a given publicly available dataset were to be used for building commercial AI software. We conduct trials of our approach on two product groups within Huawei on 6 commonly used publicly available datasets. Our results show that there are risks of license violations on 5 of these 6 studied datasets if they were used for commercial purposes. Consequently, we provide recommendations for AI engineers on how to better assess publicly available datasets for license compliance violations.

Related papers

Permissive-Washing in the Open AI Supply Chain: A Large-Scale Audit of License Integrity [12.206378714907075]
Permissive licenses like MIT, Apache-2.0, and BSD-3-Clause dominate open-source AI.<n>Permissive washing: labeling AI artifacts as free to use, while omitting the legal documentation required to make that label actionable.<n>We audit 124,278 dataset $rightarrow$ model $rightarrow$ application supply chains, spanning 3,338 datasets, 6,664 models, and 28,516 applications across Hugging Face and GitHub.
arXiv Detail & Related papers (2026-02-09T15:51:36Z)
VICTOR: Dataset Copyright Auditing in Video Recognition Systems [47.270150440169324]
We propose VICTOR, the first dataset copyright auditing approach for video recognition systems.<n> VICTOR amplifies the impact of published modified samples on the prediction behavior of the target models.<n>We show that VICTOR is robust in the presence of several perturbation mechanisms to the training videos or the target models.
arXiv Detail & Related papers (2025-12-16T14:26:01Z)
From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem [12.206378714907075]
Hidden license conflicts in the open-source AI ecosystem pose serious legal and ethical risks.<n>We present the first end-to-end audit of licenses for datasets and models on Hugging Face.
arXiv Detail & Related papers (2025-09-11T21:46:20Z)
Hey, That's My Data! Label-Only Dataset Inference in Large Language Models [63.35066172530291]
CatShift is a label-only dataset-inference framework.<n>It capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data.
arXiv Detail & Related papers (2025-06-06T13:02:59Z)
Do Not Trust Licenses You See: Dataset Compliance Requires Massive-Scale AI-Powered Lifecycle Tracing [45.6582862121583]
This paper argues that a dataset's legal risk cannot be accurately assessed by its license terms alone. It argues that tracking dataset redistribution and its full lifecycle is essential. We show that AI can perform these tasks with higher accuracy, efficiency, and cost-effectiveness than human experts.
arXiv Detail & Related papers (2025-03-04T16:57:53Z)
LicenseGPT: A Fine-tuned Foundation Model for Publicly Available Dataset License Compliance [27.595354325922436]
We introduce LicenseGPT, a fine-tuned foundation model (FM) specifically designed for dataset license compliance analysis. We evaluate existing legal FMs and find that the best-performing model achieves a Prediction Agreement (PA) of only 43.75%. We demonstrate that LicenseGPT reduces analysis time by 94.44%, from 108 seconds to 6 seconds per license, without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T19:04:13Z)
"They've Stolen My GPL-Licensed Model!": Toward Standardized and Transparent Model Licensing [30.19362102481241]
We develop a new vocabulary for ML workflow management and encoded license rules to enable ontological reasoning for analyzing rights granting and compliance issues. Our analysis tool is built on Turtle language and Notation3 reasoning engine, envisioned as first step toward Linked Open Model Data.
arXiv Detail & Related papers (2024-12-16T06:52:09Z)
Data Distribution Valuation [56.71023681599737]
Existing data valuation methods define a value for a discrete dataset. In many use cases, users are interested in not only the value of the dataset, but that of the distribution from which the dataset was sampled. We propose a maximum mean discrepancy (MMD)-based valuation method which enables theoretically principled and actionable policies.
arXiv Detail & Related papers (2024-10-06T07:56:53Z)
OpenDataLab: Empowering General Artificial Intelligence with Open Datasets [53.22840149601411]
This paper introduces OpenDataLab, a platform designed to bridge the gap between diverse data sources and the need for unified data processing. OpenDataLab integrates a wide range of open-source AI datasets and enhances data acquisition efficiency through intelligent querying and high-speed downloading services. We anticipate that OpenDataLab will significantly boost artificial general intelligence (AGI) research and facilitate advancements in related AI fields.
arXiv Detail & Related papers (2024-06-04T10:42:01Z)
On the Standardization of Behavioral Use Clauses and Their Adoption for Responsible Licensing of AI [27.748532981456464]
In 2018, licenses with behaviorial-use clauses were proposed to give developers a framework for releasing AI assets. As of the end of 2023, on the order of 40,000 software and model repositories have adopted responsible AI licenses.
arXiv Detail & Related papers (2024-02-07T22:29:42Z)
Catch the Butterfly: Peeking into the Terms and Conflicts among SPDX Licenses [16.948633594354412]
Third-party libraries (TPLs) in software development has accelerated the creation of modern software. Developers may inadvertently violate the licenses of TPLs, leading to legal issues. There is a need for a high-quality license dataset that encompasses a broad range of mainstream licenses.
arXiv Detail & Related papers (2024-01-19T11:27:34Z)
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI [41.32981860191232]
Legal and machine learning experts to systematically audit and trace 1800+ text datasets. Our landscape analysis highlights the sharp divides in composition and focus of commercially open vs closed datasets. frequent miscategorization of licenses on widely used dataset hosting sites, with license omission of 70%+ and error rates of 50%+.
arXiv Detail & Related papers (2023-10-25T17:20:26Z)
The Software Heritage License Dataset (2022 Edition) [0.0]
The dataset consists of 6.9 million unique license files. Additional metadata about shipped license files is also provided. The dataset can be used to conduct empirical studies on open source licensing, training of automated license cryptographics, natural language processing (NLP) analyses of legal texts.
arXiv Detail & Related papers (2023-08-22T08:01:07Z)
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore [159.21914121143885]
We present SILO, a new language model that manages this risk-performance tradeoff during inference. SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text. Access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile.
arXiv Detail & Related papers (2023-08-08T17:58:15Z)
LiResolver: License Incompatibility Resolution for Open Source Software [13.28021004336228]
LiResolver is a fine-grained, scalable, and flexible tool to resolve license incompatibility issues for open source software. Comprehensive experiments demonstrate the effectiveness of LiResolver, with 4.09% false positive (FP) rate and 0.02% false negative (FN) rate for incompatibility issue localization.
arXiv Detail & Related papers (2023-06-26T13:16:09Z)
Foundation Models and Fair Use [96.04664748698103]
In the U.S. and other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine. In this work, we survey the potential risks of developing and deploying foundation models based on copyrighted content. We discuss technical mitigations that can help foundation models stay in line with fair use.
arXiv Detail & Related papers (2023-03-28T03:58:40Z)
The Problem of Zombie Datasets:A Framework For Deprecating Datasets [55.878249096379804]
We examine the public afterlives of several prominent datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender. We propose a dataset deprecation framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks.
arXiv Detail & Related papers (2021-10-18T20:13:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.