The ML Supply Chain in the Era of Software 2.0: Lessons Learned from Hugging Face
- URL: http://arxiv.org/abs/2502.04484v1
- Date: Thu, 06 Feb 2025 20:17:05 GMT
- Title: The ML Supply Chain in the Era of Software 2.0: Lessons Learned from Hugging Face
- Authors: Trevor Stalnaker, Nathan Wintersgill, Oscar Chaparro, Laura A. Heymann, Massimiliano Di Penta, Daniel M German, Denys Poshyvanyk,
- Abstract summary: We conduct an extensive analysis of 760,460 models and 175,000 datasets mined from the popular model-sharing site Hugging Face.<n>We evaluate the current state of documentation in the Hugging Face supply chain, report real-world examples of shortcomings, and offer actionable suggestions for improvement.<n>Our results motivate multiple research avenues, including the need for better license management for ML models/datasets.
- Score: 10.531612371200625
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The last decade has seen widespread adoption of Machine Learning (ML) components in software systems. This has occurred in nearly every domain, from natural language processing to computer vision. These ML components range from relatively simple neural networks to complex and resource-intensive large language models. However, despite this widespread adoption, little is known about the supply chain relationships that produce these models, which can have implications for compliance and security. In this work, we conduct an extensive analysis of 760,460 models and 175,000 datasets mined from the popular model-sharing site Hugging Face. First, we evaluate the current state of documentation in the Hugging Face supply chain, report real-world examples of shortcomings, and offer actionable suggestions for improvement. Next, we analyze the underlying structure of the extant supply chain. Finally, we explore the current licensing landscape against what was reported in prior work and discuss the unique challenges posed in this domain. Our results motivate multiple research avenues, including the need for better license management for ML models/datasets, better support for model documentation, and automated inconsistency checking and validation. We make our research infrastructure and dataset available to facilitate future research.
Related papers
- Designing a reliable lateral movement detector using a graph foundation model [0.0]
Foundation models have recently emerged as a new paradigm in machine learning (ML)
These models are pre-trained on large and diverse datasets and can subsequently be applied to various downstream tasks with little or no retraining.
We study the usability of graph foundation models (GFMs) in cybersecurity through the lens of one specific use case, namely lateral movement detection.
arXiv Detail & Related papers (2025-04-18T07:39:21Z) - Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo [90.78001821963008]
A wide range of LM applications require generating text that conforms to syntactic or semantic constraints.
We develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC)
Our system builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language.
arXiv Detail & Related papers (2025-04-17T17:49:40Z) - LEMUR Neural Network Dataset: Towards Seamless AutoML [34.04248949660201]
We introduce LEMUR, an open source dataset of neural network models with well-structured code for diverse architectures.
LEMUR is primarily designed to enable fine-tuning of large language models for automated machine learning tasks.
LEMUR will be released as an open source project under the MIT license upon acceptance of the paper.
arXiv Detail & Related papers (2025-04-14T09:08:00Z) - SoK: Dataset Copyright Auditing in Machine Learning Systems [23.00196984807359]
This paper examines the current dataset copyright auditing tools, examining their effectiveness and limitations.
We categorize dataset copyright auditing research into two prominent strands: intrusive methods and non-intrusive methods.
To summarize our results, we offer detailed reference tables, highlight key points, and pinpoint unresolved issues in the current literature.
arXiv Detail & Related papers (2024-10-22T02:06:38Z) - On-Device Language Models: A Comprehensive Review [26.759861320845467]
Review examines the challenges of deploying computationally expensive large language models on resource-constrained devices.
Paper investigates on-device language models, their efficient architectures, as well as state-of-the-art compression techniques.
Case studies of on-device language models from major mobile manufacturers demonstrate real-world applications and potential benefits.
arXiv Detail & Related papers (2024-08-26T03:33:36Z) - Model Merging in LLMs, MLLMs, and Beyond: Methods, Theories, Applications and Opportunities [89.40778301238642]
Model merging is an efficient empowerment technique in the machine learning community.
There is a significant gap in the literature regarding a systematic and thorough review of these techniques.
arXiv Detail & Related papers (2024-08-14T16:58:48Z) - Large Language Model for Verilog Generation with Golden Code Feedback [29.135207235743795]
This study introduces a novel approach utilizing reinforcement learning with golden code feedback to enhance the performance of pre-trained models.
We have achieved state-of-the-art (SOTA) results with a substantial margin. Notably, our 6.7B parameter model ours demonstrates superior performance compared to current best-in-class 13B and 16B models.
arXiv Detail & Related papers (2024-07-21T11:25:21Z) - Retrieval-Enhanced Machine Learning: Synthesis and Opportunities [60.34182805429511]
Retrieval-enhancement can be extended to a broader spectrum of machine learning (ML)
This work introduces a formal framework of this paradigm, Retrieval-Enhanced Machine Learning (REML), by synthesizing the literature in various domains in ML with consistent notations which is missing from the current literature.
The goal of this work is to equip researchers across various disciplines with a comprehensive, formally structured framework of retrieval-enhanced models, thereby fostering interdisciplinary future research.
arXiv Detail & Related papers (2024-07-17T20:01:21Z) - Language Models as a Service: Overview of a New Paradigm and its
Challenges [47.75762014254756]
Some of the most powerful language models currently are proprietary systems, accessible only via (typically restrictive) web or programming.
This paper has two goals: on the one hand, we delineate how the aforementioned challenges act as impediments to the accessibility, replicability, reliability, and trustworthiness of LM interfaces.
On the other hand, it serves as a comprehensive resource for existing knowledge on current, major LM, offering a synthesized overview of the licences and capabilities their interfaces offer.
arXiv Detail & Related papers (2023-09-28T16:29:52Z) - Machine Learning for QoS Prediction in Vehicular Communication:
Challenges and Solution Approaches [46.52224306624461]
We consider maximum throughput prediction enhancing, for example, streaming or high-definition mapping applications.
We highlight how confidence can be built on machine learning technologies by better understanding the underlying characteristics of the collected data.
We use explainable AI to show that machine learning can learn underlying principles of wireless networks without being explicitly programmed.
arXiv Detail & Related papers (2023-02-23T12:29:20Z) - Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey [66.18478838828231]
Multi-modal pre-trained big models have drawn more and more attention in recent years.
This paper introduces the background of multi-modal pre-training by reviewing the conventional deep, pre-training works in natural language process, computer vision, and speech.
Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network, and knowledge enhanced pre-training.
arXiv Detail & Related papers (2023-02-20T15:34:03Z) - A Survey of Machine Unlearning [56.017968863854186]
Recent regulations now require that, on request, private information about a user must be removed from computer systems.
ML models often remember' the old data.
Recent works on machine unlearning have not been able to completely solve the problem.
arXiv Detail & Related papers (2022-09-06T08:51:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.