Related papers: PeaTMOSS: Mining Pre-Trained Models in Open-Source Software

PeaTMOSS: Mining Pre-Trained Models in Open-Source Software

URL: http://arxiv.org/abs/2310.03620v1
Date: Thu, 5 Oct 2023 15:58:45 GMT
Title: PeaTMOSS: Mining Pre-Trained Models in Open-Source Software
Authors: Wenxin Jiang, Jason Jones, Jerin Yasmin, Nicholas Synovic, Rajeev Sashti, Sophie Chen, George K. Thiruvathukal, Yuan Tian, James C. Davis
Abstract summary: We present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software. PeaTMOSS has three parts: a snapshot of 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them.
Score: 6.243303627949341
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Developing and training deep learning models is expensive, so software engineers have begun to reuse pre-trained deep learning models (PTMs) and fine-tune them for downstream tasks. Despite the wide-spread use of PTMs, we know little about the corresponding software engineering behaviors and challenges. To enable the study of software engineering with PTMs, we present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software. PeaTMOSS has three parts: a snapshot of (1) 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them. We challenge PeaTMOSS miners to discover software engineering practices around PTMs. A demo and link to the full dataset are available at: https://github.com/PurdueDualityLab/PeaTMOSS-Demos.

Related papers

How do Pre-Trained Models Support Software Engineering? An Empirical Study in Hugging Face [52.257764273141184]
Open-Source Pre-Trained Models (PTMs) provide extensive resources for various Machine Learning (ML) tasks.<n>These resources lack a classification tailored to Software Engineering (SE) needs.<n>We derive a taxonomy encompassing 147 SE tasks and apply an SE-oriented classification to PTMs in a popular open-source ML repository, Hugging Face (HF)<n>We find that code generation is the most common SE task among PTMs, while requirements engineering and software design activities receive limited attention.
arXiv Detail & Related papers (2025-06-03T15:51:17Z)
Exploring the Lifecycle and Maintenance Practices of Pre-Trained Models in Open-Source Software Repositories [1.3757201415751368]
Pre-trained models (PTMs) are becoming a common component in open-source software (OSS) development. This report presents a plan for an exploratory study to investigate how PTMs are utilized, maintained, and tested in OSS projects.
arXiv Detail & Related papers (2025-04-08T13:41:13Z)
Towards a Classification of Open-Source ML Models and Datasets for Software Engineering [52.257764273141184]
Open-source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks. These resources lack a classification tailored to Software Engineering (SE) needs. We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time.
arXiv Detail & Related papers (2024-11-14T18:52:05Z)
PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software [6.243303627949341]
This paper presents the PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed snapshots for all PTMs. The dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use. Our analysis provides the first summary statistics for the PTM supply chain, showing the trend of PTM development and common shortcomings of PTM package documentation.
arXiv Detail & Related papers (2024-02-01T15:55:50Z)
Efficient Federated Prompt Tuning for Black-box Large Pre-trained Models [62.838689691468666]
We propose Federated Black-Box Prompt Tuning (Fed-BBPT) to optimally harness each local dataset. Fed-BBPT capitalizes on a central server that aids local users in collaboratively training a prompt generator through regular aggregation. Relative to extensive fine-tuning, Fed-BBPT proficiently sidesteps memory challenges tied to PTM storage and fine-tuning on local machines.
arXiv Detail & Related papers (2023-10-04T19:30:49Z)
Naming Practices of Pre-Trained Models in Hugging Face [4.956536094440504]
Pre-Trained Models (PTMs) are used in computer systems to adapt for quality or performance prior to deployment. Researchers publish PTMs, which engineers adapt for quality or performance prior to deployment. Prior research has reported that model names are not always well chosen - and are sometimes erroneous. In this paper, we frame and conduct the first empirical investigation of PTM naming practices in the Hugging Face PTM registry.
arXiv Detail & Related papers (2023-10-02T21:13:32Z)
Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need [84.3507610522086]
Class-incremental learning (CIL) aims to adapt to emerging new classes without forgetting old ones. Recent pre-training has achieved substantial progress, making vast pre-trained models (PTMs) accessible for CIL. We argue that the core factors in CIL are adaptivity for model updating and generalizability for knowledge transferring.
arXiv Detail & Related papers (2023-03-13T17:59:02Z)
An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep Learning Model Registry [2.1346819928536687]
Machine learning engineers have begun to reuse large-scale pre-trained models (PTMs) We interviewed 12 practitioners from the most popular PTM ecosystem, Hugging Face, to learn the practices and challenges of PTM reuse. Three challenges for PTM reuse are missing attributes, discrepancies between claimed and actual performance, and model risks.
arXiv Detail & Related papers (2023-03-05T02:28:15Z)
Ranking and Tuning Pre-trained Models: A New Paradigm of Exploiting Model Hubs [136.4492678691406]
We propose a new paradigm of exploiting model hubs by ranking and tuning pre-trained models. The best ranked PTM can be fine-tuned and deployed if we have no preference for the model's architecture. The tuning part introduces a novel method for multiple PTMs tuning, which surpasses dedicated methods.
arXiv Detail & Related papers (2021-10-20T12:59:23Z)
EasyTransfer -- A Simple and Scalable Deep Transfer Learning Platform for NLP Applications [65.87067607849757]
EasyTransfer is a platform to develop deep Transfer Learning algorithms for Natural Language Processing (NLP) applications. EasyTransfer supports various NLP models in the ModelZoo, including mainstream PLMs and multi-modality models. EasyTransfer is currently deployed at Alibaba to support a variety of business scenarios.
arXiv Detail & Related papers (2020-11-18T18:41:27Z)
The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding [97.85957811603251]
We present MT-DNN, an open-source natural language understanding (NLU) toolkit that makes it easy for researchers and developers to train customized deep learning models. Built upon PyTorch and Transformers, MT-DNN is designed to facilitate rapid customization for a broad spectrum of NLU tasks. A unique feature of MT-DNN is its built-in support for robust and transferable learning using the adversarial multi-task learning paradigm.
arXiv Detail & Related papers (2020-02-19T03:05:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.