PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in
Open-Source Software
- URL: http://arxiv.org/abs/2402.00699v1
- Date: Thu, 1 Feb 2024 15:55:50 GMT
- Title: PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in
Open-Source Software
- Authors: Wenxin Jiang, Jerin Yasmin, Jason Jones, Nicholas Synovic, Jiashen
Kuo, Nathaniel Bielanski, Yuan Tian, George K. Thiruvathukal, James C. Davis
- Abstract summary: This paper presents the PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed snapshots for all PTMs.
The dataset includes 44,337 mappings from 15,129 downstream GitHub repositories to the 2,530 PTMs they use.
Our analysis provides the first summary statistics for the PTM supply chain, showing the trend of PTM development and common shortcomings of PTM package documentation.
- Score: 6.243303627949341
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development and training of deep learning models have become increasingly
costly and complex. Consequently, software engineers are adopting pre-trained
models (PTMs) for their downstream applications. The dynamics of the PTM supply
chain remain largely unexplored, signaling a clear need for structured datasets
that document not only the metadata but also the subsequent applications of
these models. Without such data, the MSR community cannot comprehensively
understand the impact of PTM adoption and reuse. This paper presents the
PeaTMOSS dataset, which comprises metadata for 281,638 PTMs and detailed
snapshots for all PTMs with over 50 monthly downloads (14,296 PTMs), along with
28,575 open-source software repositories from GitHub that utilize these models.
Additionally, the dataset includes 44,337 mappings from 15,129 downstream
GitHub repositories to the 2,530 PTMs they use. To enhance the dataset's
comprehensiveness, we developed prompts for a large language model to
automatically extract model metadata, including the model's training datasets,
parameters, and evaluation metrics. Our analysis of this dataset provides the
first summary statistics for the PTM supply chain, showing the trend of PTM
development and common shortcomings of PTM package documentation. Our example
application reveals inconsistencies in software licenses across PTMs and their
dependent projects. PeaTMOSS lays the foundation for future research, offering
rich opportunities to investigate the PTM supply chain. We outline mining
opportunities on PTMs, their downstream usage, and cross-cutting questions.
Related papers
- Automated categorization of pre-trained models for software engineering: A case study with a Hugging Face dataset [9.218130273952383]
Software engineering activities have been revolutionized by the advent of pre-trained models (PTMs)
The Hugging Face (HF) platform simplifies the use of PTMs by collecting, storing, and curating several models.
This paper introduces an approach to enable the automatic classification of PTMs for SE tasks.
arXiv Detail & Related papers (2024-05-21T20:26:17Z) - PeaTMOSS: Mining Pre-Trained Models in Open-Source Software [6.243303627949341]
We present the PeaTMOSS dataset: Pre-Trained Models in Open-Source Software.
PeaTMOSS has three parts: a snapshot of 281,638 PTMs, (2) 27,270 open-source software repositories that use PTMs, and (3) a mapping between PTMs and the projects that use them.
arXiv Detail & Related papers (2023-10-05T15:58:45Z) - Efficient Federated Prompt Tuning for Black-box Large Pre-trained Models [62.838689691468666]
We propose Federated Black-Box Prompt Tuning (Fed-BBPT) to optimally harness each local dataset.
Fed-BBPT capitalizes on a central server that aids local users in collaboratively training a prompt generator through regular aggregation.
Relative to extensive fine-tuning, Fed-BBPT proficiently sidesteps memory challenges tied to PTM storage and fine-tuning on local machines.
arXiv Detail & Related papers (2023-10-04T19:30:49Z) - A Survey on Time-Series Pre-Trained Models [37.0932706268589]
Time-Series Mining (TSM) is an important research area since it shows great potential in practical applications.
Deep learning models that rely on massive labeled data have been utilized for TSM successfully.
Recently, pre-trained models have gradually attracted attention in the time series domain due to their remarkable performance in computer vision and natural language processing.
arXiv Detail & Related papers (2023-05-18T05:27:46Z) - Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You Need [84.3507610522086]
Class-incremental learning (CIL) aims to adapt to emerging new classes without forgetting old ones.
Recent pre-training has achieved substantial progress, making vast pre-trained models (PTMs) accessible for CIL.
We argue that the core factors in CIL are adaptivity for model updating and generalizability for knowledge transferring.
arXiv Detail & Related papers (2023-03-13T17:59:02Z) - An Empirical Study of Pre-Trained Model Reuse in the Hugging Face Deep
Learning Model Registry [2.1346819928536687]
Machine learning engineers have begun to reuse large-scale pre-trained models (PTMs)
We interviewed 12 practitioners from the most popular PTM ecosystem, Hugging Face, to learn the practices and challenges of PTM reuse.
Three challenges for PTM reuse are missing attributes, discrepancies between claimed and actual performance, and model risks.
arXiv Detail & Related papers (2023-03-05T02:28:15Z) - Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey [66.18478838828231]
Multi-modal pre-trained big models have drawn more and more attention in recent years.
This paper introduces the background of multi-modal pre-training by reviewing the conventional deep, pre-training works in natural language process, computer vision, and speech.
Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network, and knowledge enhanced pre-training.
arXiv Detail & Related papers (2023-02-20T15:34:03Z) - ZooD: Exploiting Model Zoo for Out-of-Distribution Generalization [65.58562481279023]
We propose ZooD, a paradigm for PTMs ranking and ensemble with feature selection.
We evaluate our paradigm on a diverse model zoo consisting of 35 models for various Out-of-Distribution (OoD) tasks.
arXiv Detail & Related papers (2022-10-17T16:31:57Z) - Ranking and Tuning Pre-trained Models: A New Paradigm of Exploiting
Model Hubs [136.4492678691406]
We propose a new paradigm of exploiting model hubs by ranking and tuning pre-trained models.
The best ranked PTM can be fine-tuned and deployed if we have no preference for the model's architecture.
The tuning part introduces a novel method for multiple PTMs tuning, which surpasses dedicated methods.
arXiv Detail & Related papers (2021-10-20T12:59:23Z) - Pre-Trained Models: Past, Present and Future [126.21572378910746]
Large-scale pre-trained models (PTMs) have recently achieved great success and become a milestone in the field of artificial intelligence (AI)
By storing knowledge into huge parameters and fine-tuning on specific tasks, the rich knowledge implicitly encoded in huge parameters can benefit a variety of downstream tasks.
It is now the consensus of the AI community to adopt PTMs as backbone for downstream tasks rather than learning models from scratch.
arXiv Detail & Related papers (2021-06-14T02:40:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.