From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem
- URL: http://arxiv.org/abs/2509.09873v1
- Date: Thu, 11 Sep 2025 21:46:20 GMT
- Title: From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem
- Authors: James Jewitt, Hao Li, Bram Adams, Gopi Krishnan Rajbahadur, Ahmed E. Hassan,
- Abstract summary: Hidden license conflicts in the open-source AI ecosystem pose serious legal and ethical risks.<n>We present the first end-to-end audit of licenses for datasets and models on Hugging Face.
- Score: 12.206378714907075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hidden license conflicts in the open-source AI ecosystem pose serious legal and ethical risks, exposing organizations to potential litigation and users to undisclosed risk. However, the field lacks a data-driven understanding of how frequently these conflicts occur, where they originate, and which communities are most affected. We present the first end-to-end audit of licenses for datasets and models on Hugging Face, as well as their downstream integration into open-source software applications, covering 364 thousand datasets, 1.6 million models, and 140 thousand GitHub projects. Our empirical analysis reveals systemic non-compliance in which 35.5% of model-to-application transitions eliminate restrictive license clauses by relicensing under permissive terms. In addition, we prototype an extensible rule engine that encodes almost 200 SPDX and model-specific clauses for detecting license conflicts, which can solve 86.4% of license conflicts in software applications. To support future research, we release our dataset and the prototype engine. Our study highlights license compliance as a critical governance challenge in open-source AI and provides both the data and tools necessary to enable automated, AI-aware compliance at scale.
Related papers
- The Case for Contextual Copyleft: Licensing Open Source Training Data and Generative AI [1.2776470520481564]
This article introduces the Contextual Copyleft AI (CCAI) license, a novel licensing mechanism that extends copyleft requirements from training data to the resulting generative AI models.<n>The CCAI license offers significant advantages, including enhanced developer control, incentivization of open source AI development, and mitigation of openwashing practices.
arXiv Detail & Related papers (2025-07-17T01:42:51Z) - Open Source, Hidden Costs: A Systematic Literature Review on OSS License Management [10.002122950923967]
Integrating third-party software components is a common practice in modern software development.<n>A lack of understanding may lead to disputes, which can pose serious legal and operational challenges.
arXiv Detail & Related papers (2025-07-03T14:02:15Z) - Decompiling Smart Contracts with a Large Language Model [51.49197239479266]
Despite Etherscan's 78,047,845 smart contracts deployed on (as of May 26, 2025), a mere 767,520 ( 1%) are open source.<n>This opacity necessitates the automated semantic analysis of on-chain smart contract bytecode.<n>We introduce a pioneering decompilation pipeline that transforms bytecode into human-readable and semantically faithful Solidity code.
arXiv Detail & Related papers (2025-06-24T13:42:59Z) - New Tools are Needed for Tracking Adherence to AI Model Behavioral Use Clauses [21.783728820999933]
Concerns over negligent or malicious uses of AI have led to the design of mechanisms to limit the risks of the technology.<n>The result has been a proliferation of licenses with behavioral-use clauses and acceptable-use-policies.<n>In this paper we take the position that tools for tracking adoption of, and adherence to, these licenses is the natural next step.
arXiv Detail & Related papers (2025-05-28T12:26:55Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Do Not Trust Licenses You See: Dataset Compliance Requires Massive-Scale AI-Powered Lifecycle Tracing [45.6582862121583]
This paper argues that a dataset's legal risk cannot be accurately assessed by its license terms alone.<n>It argues that tracking dataset redistribution and its full lifecycle is essential.<n>We show that AI can perform these tasks with higher accuracy, efficiency, and cost-effectiveness than human experts.
arXiv Detail & Related papers (2025-03-04T16:57:53Z) - Fundamental Risks in the Current Deployment of General-Purpose AI Models: What Have We (Not) Learnt From Cybersecurity? [60.629883024152576]
Large Language Models (LLMs) have seen rapid deployment in a wide range of use cases.<n>OpenAIs Altera are just a few examples of increased autonomy, data access, and execution capabilities.<n>These methods come with a range of cybersecurity challenges.
arXiv Detail & Related papers (2024-12-19T14:44:41Z) - Consent in Crisis: The Rapid Decline of the AI Data Commons [74.68176012363253]
General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data.
We conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora.
arXiv Detail & Related papers (2024-07-20T16:50:18Z) - Near to Mid-term Risks and Opportunities of Open-Source Generative AI [94.06233419171016]
Applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education.
The potential for these seismic changes has triggered a lively debate about potential risks and resulted in calls for tighter regulation.
This regulation is likely to put at risk the budding field of open-source Generative AI.
arXiv Detail & Related papers (2024-04-25T21:14:24Z) - On the Standardization of Behavioral Use Clauses and Their Adoption for
Responsible Licensing of AI [27.748532981456464]
In 2018, licenses with behaviorial-use clauses were proposed to give developers a framework for releasing AI assets.
As of the end of 2023, on the order of 40,000 software and model repositories have adopted responsible AI licenses.
arXiv Detail & Related papers (2024-02-07T22:29:42Z) - Catch the Butterfly: Peeking into the Terms and Conflicts among SPDX
Licenses [16.948633594354412]
Third-party libraries (TPLs) in software development has accelerated the creation of modern software.
Developers may inadvertently violate the licenses of TPLs, leading to legal issues.
There is a need for a high-quality license dataset that encompasses a broad range of mainstream licenses.
arXiv Detail & Related papers (2024-01-19T11:27:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.