Software Dependencies 2.0: An Empirical Study of Reuse and Integration of Pre-Trained Models in Open-Source Projects
- URL: http://arxiv.org/abs/2509.06085v1
- Date: Sun, 07 Sep 2025 15:00:22 GMT
- Title: Software Dependencies 2.0: An Empirical Study of Reuse and Integration of Pre-Trained Models in Open-Source Projects
- Authors: Jerin Yasmin, Wenxin Jiang, James C. Davis, Yuan Tian,
- Abstract summary: Pre-trained models (PTMs) are machine learning models that have been trained in advance, often on large-scale data, and can be reused for new tasks.<n>Software Dependencies 2.0 introduces a new class of software dependency, which we term Software Dependencies 2.0.
- Score: 9.22889135297242
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained models (PTMs) are machine learning models that have been trained in advance, often on large-scale data, and can be reused for new tasks, thereby reducing the need for costly training from scratch. Their widespread adoption introduces a new class of software dependency, which we term Software Dependencies 2.0, extending beyond conventional libraries to learned behaviors embodied in trained models and their associated artifacts. The integration of PTMs as software dependencies in real projects remains unclear, potentially threatening maintainability and reliability of modern software systems that increasingly rely on them. Objective: In this study, we investigate Software Dependencies 2.0 in open-source software (OSS) projects by examining the reuse of PTMs, with a focus on how developers manage and integrate these models. Specifically, we seek to understand: (1) how OSS projects structure and document their PTM dependencies; (2) what stages and organizational patterns emerge in the reuse pipelines of PTMs within these projects; and (3) the interactions among PTMs and other learned components across pipeline stages. We conduct a mixed-methods analysis of a statistically significant random sample of 401 GitHub repositories from the PeaTMOSS dataset (28,575 repositories reusing PTMs from Hugging Face and PyTorch Hub). We quantitatively examine PTM reuse by identifying patterns and qualitatively investigate how developers integrate and manage these models in practice.
Related papers
- Forecasting the Maintained Score from the OpenSSF Scorecard for GitHub Repositories linked to PyPI libraries [78.48200143057376]
We study to what extent future maintenance activity, as captured by the OpenSSF maintained score, can be forecasted.<n>We analyze 3,220 GitHub repositories associated with the top 1% most central PyPI libraries by PageRank.<n>Our results show that future maintenance activity can be predicted with meaningful accuracy.
arXiv Detail & Related papers (2026-01-26T10:32:54Z) - How do Pre-Trained Models Support Software Engineering? An Empirical Study in Hugging Face [52.257764273141184]
Open-Source Pre-Trained Models (PTMs) provide extensive resources for various Machine Learning (ML) tasks.<n>These resources lack a classification tailored to Software Engineering (SE) needs.<n>We derive a taxonomy encompassing 147 SE tasks and apply an SE-oriented classification to PTMs in a popular open-source ML repository, Hugging Face (HF)<n>We find that code generation is the most common SE task among PTMs, while requirements engineering and software design activities receive limited attention.
arXiv Detail & Related papers (2025-06-03T15:51:17Z) - Exploring the Lifecycle and Maintenance Practices of Pre-Trained Models in Open-Source Software Repositories [1.3757201415751368]
Pre-trained models (PTMs) are becoming a common component in open-source software (OSS) development.<n>This report presents a plan for an exploratory study to investigate how PTMs are utilized, maintained, and tested in OSS projects.
arXiv Detail & Related papers (2025-04-08T13:41:13Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.<n>We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z) - Towards a Classification of Open-Source ML Models and Datasets for Software Engineering [52.257764273141184]
Open-source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks.
These resources lack a classification tailored to Software Engineering (SE) needs.
We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time.
arXiv Detail & Related papers (2024-11-14T18:52:05Z) - PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in
Open-Source Software [6.243303627949341]
This paper presents the PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed snapshots for all PTMs.
The dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use.
Our analysis provides the first summary statistics for the PTM supply chain, showing the trend of PTM development and common shortcomings of PTM package documentation.
arXiv Detail & Related papers (2024-02-01T15:55:50Z) - Continual Learning with Pre-Trained Models: A Survey [61.97613090666247]
Continual Learning aims to overcome the catastrophic forgetting of former knowledge when learning new ones.
This paper presents a comprehensive survey of the latest advancements in PTM-based CL.
arXiv Detail & Related papers (2024-01-29T18:27:52Z) - PeaTMOSS: Mining Pre-Trained Models in Open-Source Software [6.243303627949341]
We present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software.
PeaTMOSS has three parts: a snapshot of 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them.
arXiv Detail & Related papers (2023-10-05T15:58:45Z) - ZhiJian: A Unifying and Rapidly Deployable Toolbox for Pre-trained Model
Reuse [59.500060790983994]
This paper introduces ZhiJian, a comprehensive and user-friendly toolbox for model reuse, utilizing the PyTorch backend.
ZhiJian presents a novel paradigm that unifies diverse perspectives on model reuse, encompassing target architecture construction with PTM, tuning target model with PTM, and PTM-based inference.
arXiv Detail & Related papers (2023-08-17T19:12:13Z) - An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep
Learning Model Registry [2.1346819928536687]
Machine learning engineers have begun to reuse large-scale pre-trained models (PTMs)
We interviewed 12 practitioners from the most popular PTM ecosystem, Hugging Face, to learn the practices and challenges of PTM reuse.
Three challenges for PTM reuse are missing attributes, discrepancies between claimed and actual performance, and model risks.
arXiv Detail & Related papers (2023-03-05T02:28:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.