Related papers: Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement

Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement

URL: http://arxiv.org/abs/2306.16307v2
Date: Wed, 20 Dec 2023 14:28:50 GMT
Title: Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement
Authors: Kai Gao, Runzhi He, Bing Xie, Minghui Zhou
Abstract summary: Deep learning (DL) package supply chains are critical for DL frameworks to remain competitive. We analyze the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs. Our study provides rich implications on the maintenance and dependency management practices of PyPI DL SCs.
Score: 14.938727013935654
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Deep learning (DL) package supply chains (SCs) are critical for DL frameworks to remain competitive. However, vital knowledge on the nature of DL package SCs is still lacking. In this paper, we explore the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs to bridge this knowledge gap. We analyze the metadata of nearly six million PyPI package distributions and construct version-sensitive SCs for two popular DL frameworks: TensorFlow and PyTorch. We find that popular packages (measured by the number of monthly downloads) in the two SCs cover 34 domains belonging to eight categories. Applications, Infrastructure, and Sciences categories account for over 85% of popular packages in either SC and TensorFlow and PyTorch SC have developed specializations on Infrastructure and Applications packages respectively. We employ the Leiden community detection algorithm and detect 131 and 100 clusters in the two SCs. The clusters mainly exhibit four shapes: Arrow, Star, Tree, and Forest with increasing dependency complexity. Most clusters are Arrow or Star, but Tree and Forest clusters account for most packages (Tensorflow SC: 70%, PyTorch SC: 90%). We identify three groups of reasons why packages disengage from the SC (i.e., remove the DL framework and its dependents from their installation dependencies): dependency issues, functional improvements, and ease of installation. The most common disengagement reason in the two SCs are different. Our study provides rich implications on the maintenance and dependency management practices of PyPI DL SCs.

Related papers

Why Authors and Maintainers Link (or Don't Link) Their PyPI Libraries to Code Repositories and Donation Platforms [83.16077040470975]
Metadata of libraries on the Python Package Index (PyPI) plays a critical role in supporting the transparency, trust, and sustainability of open-source libraries.<n>This paper presents a large-scale empirical study combining two targeted surveys sent to 50,000 PyPI authors and maintainers.<n>We analyze more than 1,400 responses using large language model (LLM)-based topic modeling to uncover key motivations and barriers related to linking repositories and donation platforms.
arXiv Detail & Related papers (2026-01-21T16:13:57Z)
How Deep Does Your Dependency Tree Go? An Empirical Study of Dependency Amplification Across 10 Package Ecosystems [0.0]
We study 500 projects across ten major ecosystems, including Maven, npm Registry, RubyGems, Go Modules, PyPI, CocoaPods, and Pub.<n>We find significant differences with large effect sizes in 22 of 45 pairwise comparisons, challenging the assumption that npm has the highest amplification.<n>Our findings suggest adopting ecosystem specific security strategies, including systematic auditing for Maven environments, targeted outlier detection for npm and RubyGems, and continuation of current practices for ecosystems with controlled amplification.
arXiv Detail & Related papers (2025-12-12T05:53:32Z)
PyPitfall: Dependency Chaos and Software Supply Chain Vulnerabilities in Python [1.2644387713029346]
This paper introduces PyPitfall, a quantitative analysis of vulnerable dependencies across the PyPI ecosystem.<n>We analyzed the dependency structures of 378,573 PyPI packages and identified 4,655 packages that explicitly require at least one known-vulnerable version.<n>We aim to raise awareness of Python software supply chain security by characterizing the ecosystem-wide dependency landscape.
arXiv Detail & Related papers (2025-07-24T03:58:18Z)
Analyzing the Usage of Donation Platforms for PyPI Libraries [91.97201077607862]
This study analyzes the adoption of donation platforms in the PyPI ecosystem. GitHub Sponsors is the dominant platform, though many PyPI-listed links are outdated.
arXiv Detail & Related papers (2025-03-11T10:27:31Z)
Rethinking Reuse in Dependency Supply Chains: Initial Analysis of NPM packages at the End of the Chain [2.4969046521751768]
This paper advocates for a shift in software development practices toward minimizing reliance on third-party packages. We find that these end-of-chain packages offer unique insights, as they play a key role in the ecosystem.
arXiv Detail & Related papers (2025-03-04T17:26:34Z)
A First Look at Package-to-Group Mechanism: An Empirical Study of the Linux Distributions [20.491275902894273]
A package-to-group mechanism (P2G) is employed to enable unified installation, uninstallation, and updates of multiple packages at once. This paper takes Linux distributions as a case study and presents an empirical study focusing on its application trends, evolutionary patterns, group quality, and developer tendencies.
arXiv Detail & Related papers (2024-10-14T03:48:20Z)
An Overview and Catalogue of Dependency Challenges in Open Source Software Package Registries [52.23798016734889]
This article provides a catalogue of dependency-related challenges that come with relying on OSS packages or libraries. The catalogue is based on the scientific literature on empirical research that has been conducted to understand, quantify and overcome these challenges.
arXiv Detail & Related papers (2024-09-27T16:20:20Z)
Analyzing the Accessibility of GitHub Repositories for PyPI and NPM Libraries [91.97201077607862]
Industrial applications heavily rely on open-source software (OSS) libraries, which provide various benefits. To monitor the activities of such communities, a comprehensive list of repositories for the libraries of an ecosystem must be accessible. In this study, we analyze the accessibility of GitHub repositories for PyPI and NPM libraries.
arXiv Detail & Related papers (2024-04-26T13:27:04Z)
DONAPI: Malicious NPM Packages Detector using Behavior Sequence Knowledge Mapping [28.852274185512236]
npm is the most extensive package manager, hosting more than 2 million third-party open-source packages. In this paper, we synchronize a local package cache containing more than 3.4 million packages in near real-time to give us access to more package code details. We propose the DONAPI, an automatic malicious npm packages detector that combines static and dynamic analysis.
arXiv Detail & Related papers (2024-03-13T08:38:21Z)
Three Heads Are Better Than One: Complementary Experts for Long-Tailed Semi-supervised Learning [74.44500692632778]
We propose a novel method named ComPlementary Experts (CPE) to model various class distributions. CPE achieves state-of-the-art performances on CIFAR-10-LT, CIFAR-100-LT, and STL-10-LT dataset benchmarks.
arXiv Detail & Related papers (2023-12-25T11:54:07Z)
Less is More? An Empirical Study on Configuration Issues in Python PyPI Ecosystem [38.44692482370243]
Python is widely used in the open-source community, largely owing to the extensive support from diverse third-party libraries. Third-party libraries can potentially lead to conflicts in dependencies, prompting researchers to develop dependency conflict detectors. endeavors have been made to automatically infer dependencies.
arXiv Detail & Related papers (2023-10-19T09:07:51Z)
PyPOTS: A Python Toolbox for Data Mining on Partially-Observed Time Series [0.0]
PyPOTS is an open-source Python library dedicated to data mining and analysis on partially-observed time series. It provides easy access to diverse algorithms categorized into four tasks: imputation, classification, clustering, and forecasting.
arXiv Detail & Related papers (2023-05-30T07:57:05Z)
DADApy: Distance-based Analysis of DAta-manifolds in Python [51.37841707191944]
DADApy is a python software package for analysing and characterising high-dimensional data. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics.
arXiv Detail & Related papers (2022-05-04T08:41:59Z)
Pack Together: Entity and Relation Extraction with Levitated Marker [61.232174424421025]
We propose a novel span representation approach, named Packed Levitated Markers, to consider the dependencies between the spans (pairs) by strategically packing the markers in the encoder. Our experiments show that our model with packed levitated markers outperforms the sequence labeling model by 0.4%-1.9% F1 on three flat NER tasks, and beats the token concat model on six NER benchmarks.
arXiv Detail & Related papers (2021-09-13T15:38:13Z)
An Empirical Analysis of the R Package Ecosystem [0.0]
We analyze more than 25,000 packages, 150,000 releases, and 15 million files across two decades. We find that the historical growth of the ecosystem has been robust under all measures.
arXiv Detail & Related papers (2021-02-19T12:55:18Z)
Superiority of Simplicity: A Lightweight Model for Network Device Workload Prediction [58.98112070128482]
We propose a lightweight solution for series prediction based on historic observations. It consists of a heterogeneous ensemble method composed of two models - a neural network and a mean predictor. It achieves an overall $R2$ score of 0.10 on the available FedCSIS 2020 challenge dataset.
arXiv Detail & Related papers (2020-07-07T15:44:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.