Related papers: From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem

From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem

URL: http://arxiv.org/abs/2509.09873v1
Date: Thu, 11 Sep 2025 21:46:20 GMT
Title: From Hugging Face to GitHub: Tracing License Drift in the Open-Source AI Ecosystem
Authors: James Jewitt, Hao Li, Bram Adams, Gopi Krishnan Rajbahadur, Ahmed E. Hassan,
Abstract summary: Hidden license conflicts in the open-source AI ecosystem pose serious legal and ethical risks.<n>We present the first end-to-end audit of licenses for datasets and models on Hugging Face.
Score: 12.206378714907075
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hidden license conflicts in the open-source AI ecosystem pose serious legal and ethical risks, exposing organizations to potential litigation and users to undisclosed risk. However, the field lacks a data-driven understanding of how frequently these conflicts occur, where they originate, and which communities are most affected. We present the first end-to-end audit of licenses for datasets and models on Hugging Face, as well as their downstream integration into open-source software applications, covering 364 thousand datasets, 1.6 million models, and 140 thousand GitHub projects. Our empirical analysis reveals systemic non-compliance in which 35.5% of model-to-application transitions eliminate restrictive license clauses by relicensing under permissive terms. In addition, we prototype an extensible rule engine that encodes almost 200 SPDX and model-specific clauses for detecting license conflicts, which can solve 86.4% of license conflicts in software applications. To support future research, we release our dataset and the prototype engine. Our study highlights license compliance as a critical governance challenge in open-source AI and provides both the data and tools necessary to enable automated, AI-aware compliance at scale.

Related papers

The Case for Contextual Copyleft: Licensing Open Source Training Data and Generative AI [1.2776470520481564]
This article introduces the Contextual Copyleft AI (CCAI) license, a novel licensing mechanism that extends copyleft requirements from training data to the resulting generative AI models.<n>The CCAI license offers significant advantages, including enhanced developer control, incentivization of open source AI development, and mitigation of openwashing practices.
arXiv Detail & Related papers (2025-07-17T01:42:51Z)
Open Source, Hidden Costs: A Systematic Literature Review on OSS License Management [10.002122950923967]
Integrating third-party software components is a common practice in modern software development.<n>A lack of understanding may lead to disputes, which can pose serious legal and operational challenges.
arXiv Detail & Related papers (2025-07-03T14:02:15Z)
Decompiling Smart Contracts with a Large Language Model [51.49197239479266]
Despite Etherscan's 78,047,845 smart contracts deployed on (as of May 26, 2025), a mere 767,520 ( 1%) are open source.<n>This opacity necessitates the automated semantic analysis of on-chain smart contract bytecode.<n>We introduce a pioneering decompilation pipeline that transforms bytecode into human-readable and semantically faithful Solidity code.
arXiv Detail & Related papers (2025-06-24T13:42:59Z)
New Tools are Needed for Tracking Adherence to AI Model Behavioral Use Clauses [21.783728820999933]
Concerns over negligent or malicious uses of AI have led to the design of mechanisms to limit the risks of the technology.<n>The result has been a proliferation of licenses with behavioral-use clauses and acceptable-use-policies.<n>In this paper we take the position that tools for tracking adoption of, and adherence to, these licenses is the natural next step.
arXiv Detail & Related papers (2025-05-28T12:26:55Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Do Not Trust Licenses You See: Dataset Compliance Requires Massive-Scale AI-Powered Lifecycle Tracing [45.6582862121583]
This paper argues that a dataset's legal risk cannot be accurately assessed by its license terms alone.<n>It argues that tracking dataset redistribution and its full lifecycle is essential.<n>We show that AI can perform these tasks with higher accuracy, efficiency, and cost-effectiveness than human experts.
arXiv Detail & Related papers (2025-03-04T16:57:53Z)
Fundamental Risks in the Current Deployment of General-Purpose AI Models: What Have We (Not) Learnt From Cybersecurity? [60.629883024152576]
Large Language Models (LLMs) have seen rapid deployment in a wide range of use cases.<n>OpenAIs Altera are just a few examples of increased autonomy, data access, and execution capabilities.<n>These methods come with a range of cybersecurity challenges.
arXiv Detail & Related papers (2024-12-19T14:44:41Z)
Consent in Crisis: The Rapid Decline of the AI Data Commons [74.68176012363253]
General-purpose artificial intelligence (AI) systems are built on massive swathes of public web data. We conduct the first, large-scale, longitudinal audit of the consent protocols for the web domains underlying AI training corpora.
arXiv Detail & Related papers (2024-07-20T16:50:18Z)
Near to Mid-term Risks and Opportunities of Open-Source Generative AI [94.06233419171016]
Applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about potential risks and resulted in calls for tighter regulation. This regulation is likely to put at risk the budding field of open-source Generative AI.
arXiv Detail & Related papers (2024-04-25T21:14:24Z)
On the Standardization of Behavioral Use Clauses and Their Adoption for Responsible Licensing of AI [27.748532981456464]
In 2018, licenses with behaviorial-use clauses were proposed to give developers a framework for releasing AI assets. As of the end of 2023, on the order of 40,000 software and model repositories have adopted responsible AI licenses.
arXiv Detail & Related papers (2024-02-07T22:29:42Z)
Catch the Butterfly: Peeking into the Terms and Conflicts among SPDX Licenses [16.948633594354412]
Third-party libraries (TPLs) in software development has accelerated the creation of modern software. Developers may inadvertently violate the licenses of TPLs, leading to legal issues. There is a need for a high-quality license dataset that encompasses a broad range of mainstream licenses.
arXiv Detail & Related papers (2024-01-19T11:27:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.