Permissive-Washing in the Open AI Supply Chain: A Large-Scale Audit of License Integrity
- URL: http://arxiv.org/abs/2602.08816v1
- Date: Mon, 09 Feb 2026 15:51:36 GMT
- Title: Permissive-Washing in the Open AI Supply Chain: A Large-Scale Audit of License Integrity
- Authors: James Jewitt, Gopi Krishnan Rajbahadur, Hao Li, Bram Adams, Ahmed E. Hassan,
- Abstract summary: Permissive licenses like MIT, Apache-2.0, and BSD-3-Clause dominate open-source AI.<n>Permissive washing: labeling AI artifacts as free to use, while omitting the legal documentation required to make that label actionable.<n>We audit 124,278 dataset $rightarrow$ model $rightarrow$ application supply chains, spanning 3,338 datasets, 6,664 models, and 28,516 applications across Hugging Face and GitHub.
- Score: 12.206378714907075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Permissive licenses like MIT, Apache-2.0, and BSD-3-Clause dominate open-source AI, signaling that artifacts like models, datasets, and code can be freely used, modified, and redistributed. However, these licenses carry mandatory requirements: include the full license text, provide a copyright notice, and preserve upstream attribution, that remain unverified at scale. Failure to meet these conditions can place reuse outside the scope of the license, effectively leaving AI artifacts under default copyright for those uses and exposing downstream users to litigation. We call this phenomenon ``permissive washing'': labeling AI artifacts as free to use, while omitting the legal documentation required to make that label actionable. To assess how widespread permissive washing is in the AI supply chain, we empirically audit 124,278 dataset $\rightarrow$ model $\rightarrow$ application supply chains, spanning 3,338 datasets, 6,664 models, and 28,516 applications across Hugging Face and GitHub. We find that an astonishing 96.5\% of datasets and 95.8\% of models lack the required license text, only 2.3\% of datasets and 3.2\% of models satisfy both license text and copyright requirements, and even when upstream artifacts provide complete licensing evidence, attribution rarely propagates downstream: only 27.59\% of models preserve compliant dataset notices and only 5.75\% of applications preserve compliant model notices (with just 6.38\% preserving any linked upstream notice). Practitioners cannot assume permissive labels confer the rights they claim: license files and notices, not metadata, are the source of legal truth. To support future research, we release our full audit dataset and reproducible pipeline.
Related papers
- From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem [12.206378714907075]
Hidden license conflicts in the open-source AI ecosystem pose serious legal and ethical risks.<n>We present the first end-to-end audit of licenses for datasets and models on Hugging Face.
arXiv Detail & Related papers (2025-09-11T21:46:20Z) - CertDW: Towards Certified Dataset Ownership Verification via Conformal Prediction [48.82467166657901]
We propose the first certified dataset watermark (i.e., CertDW) and CertDW-based certified dataset ownership verification method.<n>Inspired by conformal prediction, we introduce two statistical measures, including principal probability (PP) and watermark robustness (WR)<n>We prove there exists a provable lower bound between PP and WR, enabling ownership verification when a suspicious model's WR value significantly exceeds the PP values of benign models trained on watermark-free datasets.
arXiv Detail & Related papers (2025-06-16T07:17:23Z) - Do Not Trust Licenses You See: Dataset Compliance Requires Massive-Scale AI-Powered Lifecycle Tracing [45.6582862121583]
This paper argues that a dataset's legal risk cannot be accurately assessed by its license terms alone.<n>It argues that tracking dataset redistribution and its full lifecycle is essential.<n>We show that AI can perform these tasks with higher accuracy, efficiency, and cost-effectiveness than human experts.
arXiv Detail & Related papers (2025-03-04T16:57:53Z) - Dataset Protection via Watermarked Canaries in Retrieval-Augmented LLMs [67.0310240737424]
We introduce a novel approach to safeguard the ownership of text datasets and effectively detect unauthorized use by the RA-LLMs.<n>Our approach preserves the original data completely unchanged while protecting it by inserting specifically designed canary documents into the IP dataset.<n>During the detection process, unauthorized usage is identified by querying the canary documents and analyzing the responses of RA-LLMs.
arXiv Detail & Related papers (2025-02-15T04:56:45Z) - "They've Stolen My GPL-Licensed Model!": Toward Standardized and Transparent Model Licensing [30.19362102481241]
We develop a new vocabulary for ML workflow management and encoded license rules to enable ontological reasoning for analyzing rights granting and compliance issues.<n>Our analysis tool is built on Turtle language and Notation3 reasoning engine, envisioned as first step toward Linked Open Model Data.
arXiv Detail & Related papers (2024-12-16T06:52:09Z) - OSS License Identification at Scale: A Comprehensive Dataset Using World of Code [4.954816514146113]
This study presents a reusable and comprehensive dataset of open source software (OSS) licenses.<n>We found and identified 5.5 million distinct license blobs in OSS projects.<n>The dataset is open, providing a valuable resource for developers, researchers, and legal professionals in the OSS community.
arXiv Detail & Related papers (2024-09-07T13:34:55Z) - Trustless Audits without Revealing Data or Models [49.23322187919369]
We show that it is possible to allow model providers to keep their model weights (but not architecture) and data secret while allowing other parties to trustlessly audit model and data properties.
We do this by designing a protocol called ZkAudit in which model providers publish cryptographic commitments of datasets and model weights.
arXiv Detail & Related papers (2024-04-06T04:43:06Z) - A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models [52.49582606341111]
Copyright law confers creators the exclusive rights to reproduce, distribute, and monetize their creative works.
Recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement.
We introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset.
arXiv Detail & Related papers (2024-01-04T11:14:01Z) - Foundation Models and Fair Use [96.04664748698103]
In the U.S. and other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine.
In this work, we survey the potential risks of developing and deploying foundation models based on copyrighted content.
We discuss technical mitigations that can help foundation models stay in line with fair use.
arXiv Detail & Related papers (2023-03-28T03:58:40Z) - Can I use this publicly available dataset to build commercial AI
software? Most likely not [8.853674186565934]
We propose a new approach to assess the potential license compliance violations if a given publicly available dataset were to be used for building commercial AI software.
Our results show that there are risks of license violations on 5 of these 6 studied datasets if they were used for commercial purposes.
arXiv Detail & Related papers (2021-11-03T17:44:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.