Related papers: On the Feasibility of Cross-Language Detection of Malicious Packages in npm and PyPI

On the Feasibility of Cross-Language Detection of Malicious Packages in npm and PyPI

URL: http://arxiv.org/abs/2310.09571v1
Date: Sat, 14 Oct 2023 12:32:51 GMT
Title: On the Feasibility of Cross-Language Detection of Malicious Packages in npm and PyPI
Authors: Piergiorgio Ladisa and Serena Elisa Ponta and Nicola Ronzoni and Matias Martinez and Olivier Barais
Abstract summary: Malicious users started to spread malware by publishing open-source packages containing malicious code. Recent works apply machine learning techniques to detect malicious packages in the npm ecosystem. We present a novel approach that involves a set of language-independent features and the training of models capable of detecting malicious packages in npm and PyPI.
Score: 6.935278888313423
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Current software supply chains heavily rely on open-source packages hosted in public repositories. Given the popularity of ecosystems like npm and PyPI, malicious users started to spread malware by publishing open-source packages containing malicious code. Recent works apply machine learning techniques to detect malicious packages in the npm ecosystem. However, the scarcity of samples poses a challenge to the application of machine learning techniques in other ecosystems. Despite the differences between JavaScript and Python, the open-source software supply chain attacks targeting such languages show noticeable similarities (e.g., use of installation scripts, obfuscated strings, URLs). In this paper, we present a novel approach that involves a set of language-independent features and the training of models capable of detecting malicious packages in npm and PyPI by capturing their commonalities. This methodology allows us to train models on a diverse dataset encompassing multiple languages, thereby overcoming the challenge of limited sample availability. We evaluate the models both in a controlled experiment (where labels of data are known) and in the wild by scanning newly uploaded packages for both npm and PyPI for 10 days. We find that our approach successfully detects malicious packages for both npm and PyPI. Over an analysis of 31,292 packages, we reported 58 previously unknown malicious packages (38 for npm and 20 for PyPI), which were consequently removed from the respective repositories.

Related papers

Analyzing the Usage of Donation Platforms for PyPI Libraries [91.97201077607862]
This study analyzes the adoption of donation platforms in the PyPI ecosystem. GitHub Sponsors is the dominant platform, though many PyPI-listed links are outdated.
arXiv Detail & Related papers (2025-03-11T10:27:31Z)
DySec: A Machine Learning-based Dynamic Analysis for Detecting Malicious Packages in PyPI Ecosystem [4.045165357831481]
Malicious Python packages make software supply chains vulnerable by exploiting trust in open-source repositories like Python Package Index (PyPI) Lack of real-time behavioral monitoring makes metadata inspection and static code analysis inadequate against advanced attack strategies. We introduce DySec, a machine learning-based dynamic analysis framework for PyPI that uses eBPF kernel and user-level probes to monitor behaviors during package installation.
arXiv Detail & Related papers (2025-03-01T03:20:42Z)
PyPulse: A Python Library for Biosignal Imputation [58.35269251730328]
We introduce PyPulse, a Python package for imputation of biosignals in both clinical and wearable sensor settings. PyPulse's framework provides a modular and extendable framework with high ease-of-use for a broad userbase, including non-machine-learning bioresearchers. We released PyPulse under the MIT License on Github and PyPI.
arXiv Detail & Related papers (2024-12-09T11:00:55Z)
A Machine Learning-Based Approach For Detecting Malicious PyPI Packages [4.311626046942916]
In modern software development, the use of external libraries and packages is increasingly prevalent. This reliance on reusing code introduces serious risks for deployed software in the form of malicious packages. We propose a data-driven approach that uses machine learning and static analysis to examine the package's metadata, code, files, and textual characteristics.
arXiv Detail & Related papers (2024-12-06T18:49:06Z)
Analyzing the Accessibility of GitHub Repositories for PyPI and NPM Libraries [91.97201077607862]
Industrial applications heavily rely on open-source software (OSS) libraries, which provide various benefits. To monitor the activities of such communities, a comprehensive list of repositories for the libraries of an ecosystem must be accessible. In this study, we analyze the accessibility of GitHub repositories for PyPI and NPM libraries.
arXiv Detail & Related papers (2024-04-26T13:27:04Z)
DONAPI: Malicious NPM Packages Detector using Behavior Sequence Knowledge Mapping [28.852274185512236]
npm is the most extensive package manager, hosting more than 2 million third-party open-source packages. In this paper, we synchronize a local package cache containing more than 3.4 million packages in near real-time to give us access to more package code details. We propose the DONAPI, an automatic malicious npm packages detector that combines static and dynamic analysis.
arXiv Detail & Related papers (2024-03-13T08:38:21Z)
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions [79.72930339711478]
$textbfpyvene$ is an open-source library that supports customizable interventions on a range of different PyTorch modules. We show how $textbfpyvene$ provides a unified framework for performing interventions on neural models and sharing the intervened upon models with others.
arXiv Detail & Related papers (2024-03-12T16:46:54Z)
Malicious Package Detection using Metadata Information [0.272760415353533]
We introduce a metadata-based malicious package detection model, MeMPtec. MeMPtec extracts a set of features from package metadata information. Our experiments indicate a significant reduction in both false positives and false negatives.
arXiv Detail & Related papers (2024-02-12T06:54:57Z)
An Empirical Study of Malicious Code In PyPI Ecosystem [15.739368369031277]
PyPI provides a convenient and accessible package management platform to developers. The rapid development of the PyPI ecosystem has led to a severe problem of malicious package propagation. We conduct an empirical study to understand the characteristics and current state of the malicious code lifecycle in the PyPI ecosystem.
arXiv Detail & Related papers (2023-09-20T02:51:02Z)
Malicious Package Detection in NPM and PyPI using a Single Model of Malicious Behavior Sequence [7.991922551051611]
Package registries NPM and PyPI have been flooded with malicious packages. The effectiveness of existing malicious NPM and PyPI package detection approaches is hindered by two challenges. We propose and implement Cerebro to detect malicious packages in NPM and PyPI.
arXiv Detail & Related papers (2023-09-06T00:58:59Z)
PyPOTS: A Python Toolbox for Data Mining on Partially-Observed Time Series [0.0]
PyPOTS is an open-source Python library dedicated to data mining and analysis on partially-observed time series. It provides easy access to diverse algorithms categorized into four tasks: imputation, classification, clustering, and forecasting.
arXiv Detail & Related papers (2023-05-30T07:57:05Z)
DADApy: Distance-based Analysis of DAta-manifolds in Python [51.37841707191944]
DADApy is a python software package for analysing and characterising high-dimensional data. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics.
arXiv Detail & Related papers (2022-05-04T08:41:59Z)
Interactive Visualization of Protein RINs using NetworKit in the Cloud [57.780880387925954]
In this paper, we consider an example from protein dynamics, specifically residue interaction networks (RINs) We use NetworKit to build a cloud-based environment that enables domain scientists to run their visualization and analysis on large compute servers. To demonstrate the versatility of this approach, we use it to build a custom Jupyter-based widget for RIN visualization.
arXiv Detail & Related papers (2022-03-02T17:41:45Z)
PyHHMM: A Python Library for Heterogeneous Hidden Markov Models [63.01207205641885]
PyHHMM is an object-oriented Python implementation of Heterogeneous-Hidden Markov Models (HHMMs) PyHHMM emphasizes features not supported in similar available frameworks: a heterogeneous observation model, missing data inference, different model order selection criterias, and semi-supervised training. PyHHMM relies on the numpy, scipy, scikit-learn, and seaborn Python packages, and is distributed under the Apache-2.0 License.
arXiv Detail & Related papers (2022-01-12T07:32:36Z)
mvlearn: Multiview Machine Learning in Python [103.55817158943866]
mvlearn is a Python library which implements the leading multiview machine learning methods. The package can be installed from Python Package Index (PyPI) and the conda package manager.
arXiv Detail & Related papers (2020-05-25T02:35:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.