Characterizing Deep Learning Package Supply Chains in PyPI: Domains,
Clusters, and Disengagement
- URL: http://arxiv.org/abs/2306.16307v2
- Date: Wed, 20 Dec 2023 14:28:50 GMT
- Title: Characterizing Deep Learning Package Supply Chains in PyPI: Domains,
Clusters, and Disengagement
- Authors: Kai Gao, Runzhi He, Bing Xie, Minghui Zhou
- Abstract summary: Deep learning (DL) package supply chains are critical for DL frameworks to remain competitive.
We analyze the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs.
Our study provides rich implications on the maintenance and dependency management practices of PyPI DL SCs.
- Score: 14.938727013935654
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Deep learning (DL) package supply chains (SCs) are critical for DL frameworks
to remain competitive. However, vital knowledge on the nature of DL package SCs
is still lacking. In this paper, we explore the domains, clusters, and
disengagement of packages in two representative PyPI DL package SCs to bridge
this knowledge gap. We analyze the metadata of nearly six million PyPI package
distributions and construct version-sensitive SCs for two popular DL
frameworks: TensorFlow and PyTorch. We find that popular packages (measured by
the number of monthly downloads) in the two SCs cover 34 domains belonging to
eight categories. Applications, Infrastructure, and Sciences categories account
for over 85% of popular packages in either SC and TensorFlow and PyTorch SC
have developed specializations on Infrastructure and Applications packages
respectively. We employ the Leiden community detection algorithm and detect 131
and 100 clusters in the two SCs. The clusters mainly exhibit four shapes:
Arrow, Star, Tree, and Forest with increasing dependency complexity. Most
clusters are Arrow or Star, but Tree and Forest clusters account for most
packages (Tensorflow SC: 70%, PyTorch SC: 90%). We identify three groups of
reasons why packages disengage from the SC (i.e., remove the DL framework and
its dependents from their installation dependencies): dependency issues,
functional improvements, and ease of installation. The most common
disengagement reason in the two SCs are different. Our study provides rich
implications on the maintenance and dependency management practices of PyPI DL
SCs.
Related papers
- A First Look at Package-to-Group Mechanism: An Empirical Study of the Linux Distributions [20.491275902894273]
A package-to-group mechanism (P2G) is employed to enable unified installation, uninstallation, and updates of multiple packages at once.
This paper takes Linux distributions as a case study and presents an empirical study focusing on its application trends, evolutionary patterns, group quality, and developer tendencies.
arXiv Detail & Related papers (2024-10-14T03:48:20Z) - An Overview and Catalogue of Dependency Challenges in Open Source Software Package Registries [52.23798016734889]
This article provides a catalogue of dependency-related challenges that come with relying on OSS packages or libraries.
The catalogue is based on the scientific literature on empirical research that has been conducted to understand, quantify and overcome these challenges.
arXiv Detail & Related papers (2024-09-27T16:20:20Z) - Analyzing the Accessibility of GitHub Repositories for PyPI and NPM Libraries [91.97201077607862]
Industrial applications heavily rely on open-source software (OSS) libraries, which provide various benefits.
To monitor the activities of such communities, a comprehensive list of repositories for the libraries of an ecosystem must be accessible.
In this study, we analyze the accessibility of GitHub repositories for PyPI and NPM libraries.
arXiv Detail & Related papers (2024-04-26T13:27:04Z) - DONAPI: Malicious NPM Packages Detector using Behavior Sequence Knowledge Mapping [28.852274185512236]
npm is the most extensive package manager, hosting more than 2 million third-party open-source packages.
In this paper, we synchronize a local package cache containing more than 3.4 million packages in near real-time to give us access to more package code details.
We propose the DONAPI, an automatic malicious npm packages detector that combines static and dynamic analysis.
arXiv Detail & Related papers (2024-03-13T08:38:21Z) - Three Heads Are Better Than One: Complementary Experts for Long-Tailed Semi-supervised Learning [74.44500692632778]
We propose a novel method named ComPlementary Experts (CPE) to model various class distributions.
CPE achieves state-of-the-art performances on CIFAR-10-LT, CIFAR-100-LT, and STL-10-LT dataset benchmarks.
arXiv Detail & Related papers (2023-12-25T11:54:07Z) - Less is More? An Empirical Study on Configuration Issues in Python PyPI
Ecosystem [38.44692482370243]
Python is widely used in the open-source community, largely owing to the extensive support from diverse third-party libraries.
Third-party libraries can potentially lead to conflicts in dependencies, prompting researchers to develop dependency conflict detectors.
endeavors have been made to automatically infer dependencies.
arXiv Detail & Related papers (2023-10-19T09:07:51Z) - PyPOTS: A Python Toolbox for Data Mining on Partially-Observed Time
Series [0.0]
PyPOTS is an open-source Python library dedicated to data mining and analysis on partially-observed time series.
It provides easy access to diverse algorithms categorized into four tasks: imputation, classification, clustering, and forecasting.
arXiv Detail & Related papers (2023-05-30T07:57:05Z) - DADApy: Distance-based Analysis of DAta-manifolds in Python [51.37841707191944]
DADApy is a python software package for analysing and characterising high-dimensional data.
It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics.
arXiv Detail & Related papers (2022-05-04T08:41:59Z) - Pack Together: Entity and Relation Extraction with Levitated Marker [61.232174424421025]
We propose a novel span representation approach, named Packed Levitated Markers, to consider the dependencies between the spans (pairs) by strategically packing the markers in the encoder.
Our experiments show that our model with packed levitated markers outperforms the sequence labeling model by 0.4%-1.9% F1 on three flat NER tasks, and beats the token concat model on six NER benchmarks.
arXiv Detail & Related papers (2021-09-13T15:38:13Z) - An Empirical Analysis of the R Package Ecosystem [0.0]
We analyze more than 25,000 packages, 150,000 releases, and 15 million files across two decades.
We find that the historical growth of the ecosystem has been robust under all measures.
arXiv Detail & Related papers (2021-02-19T12:55:18Z) - Superiority of Simplicity: A Lightweight Model for Network Device
Workload Prediction [58.98112070128482]
We propose a lightweight solution for series prediction based on historic observations.
It consists of a heterogeneous ensemble method composed of two models - a neural network and a mean predictor.
It achieves an overall $R2$ score of 0.10 on the available FedCSIS 2020 challenge dataset.
arXiv Detail & Related papers (2020-07-07T15:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.