On the Feasibility of Cross-Language Detection of Malicious Packages in
npm and PyPI
- URL: http://arxiv.org/abs/2310.09571v1
- Date: Sat, 14 Oct 2023 12:32:51 GMT
- Title: On the Feasibility of Cross-Language Detection of Malicious Packages in
npm and PyPI
- Authors: Piergiorgio Ladisa and Serena Elisa Ponta and Nicola Ronzoni and
Matias Martinez and Olivier Barais
- Abstract summary: Malicious users started to spread malware by publishing open-source packages containing malicious code.
Recent works apply machine learning techniques to detect malicious packages in the npm ecosystem.
We present a novel approach that involves a set of language-independent features and the training of models capable of detecting malicious packages in npm and PyPI.
- Score: 6.935278888313423
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Current software supply chains heavily rely on open-source packages hosted in
public repositories. Given the popularity of ecosystems like npm and PyPI,
malicious users started to spread malware by publishing open-source packages
containing malicious code. Recent works apply machine learning techniques to
detect malicious packages in the npm ecosystem. However, the scarcity of
samples poses a challenge to the application of machine learning techniques in
other ecosystems. Despite the differences between JavaScript and Python, the
open-source software supply chain attacks targeting such languages show
noticeable similarities (e.g., use of installation scripts, obfuscated strings,
URLs).
In this paper, we present a novel approach that involves a set of
language-independent features and the training of models capable of detecting
malicious packages in npm and PyPI by capturing their commonalities. This
methodology allows us to train models on a diverse dataset encompassing
multiple languages, thereby overcoming the challenge of limited sample
availability. We evaluate the models both in a controlled experiment (where
labels of data are known) and in the wild by scanning newly uploaded packages
for both npm and PyPI for 10 days.
We find that our approach successfully detects malicious packages for both
npm and PyPI. Over an analysis of 31,292 packages, we reported 58 previously
unknown malicious packages (38 for npm and 20 for PyPI), which were
consequently removed from the respective repositories.
Related papers
- Analyzing the Accessibility of GitHub Repositories for PyPI and NPM Libraries [91.97201077607862]
Industrial applications heavily rely on open-source software (OSS) libraries, which provide various benefits.
To monitor the activities of such communities, a comprehensive list of repositories for the libraries of an ecosystem must be accessible.
In this study, we analyze the accessibility of GitHub repositories for PyPI and NPM libraries.
arXiv Detail & Related papers (2024-04-26T13:27:04Z) - DONAPI: Malicious NPM Packages Detector using Behavior Sequence Knowledge Mapping [28.852274185512236]
npm is the most extensive package manager, hosting more than 2 million third-party open-source packages.
In this paper, we synchronize a local package cache containing more than 3.4 million packages in near real-time to give us access to more package code details.
We propose the DONAPI, an automatic malicious npm packages detector that combines static and dynamic analysis.
arXiv Detail & Related papers (2024-03-13T08:38:21Z) - pyvene: A Library for Understanding and Improving PyTorch Models via
Interventions [79.72930339711478]
$textbfpyvene$ is an open-source library that supports customizable interventions on a range of different PyTorch modules.
We show how $textbfpyvene$ provides a unified framework for performing interventions on neural models and sharing the intervened upon models with others.
arXiv Detail & Related papers (2024-03-12T16:46:54Z) - Malicious Package Detection using Metadata Information [0.272760415353533]
We introduce a metadata-based malicious package detection model, MeMPtec.
MeMPtec extracts a set of features from package metadata information.
Our experiments indicate a significant reduction in both false positives and false negatives.
arXiv Detail & Related papers (2024-02-12T06:54:57Z) - An Empirical Study of Malicious Code In PyPI Ecosystem [15.739368369031277]
PyPI provides a convenient and accessible package management platform to developers.
The rapid development of the PyPI ecosystem has led to a severe problem of malicious package propagation.
We conduct an empirical study to understand the characteristics and current state of the malicious code lifecycle in the PyPI ecosystem.
arXiv Detail & Related papers (2023-09-20T02:51:02Z) - Malicious Package Detection in NPM and PyPI using a Single Model of
Malicious Behavior Sequence [7.991922551051611]
Package registries NPM and PyPI have been flooded with malicious packages.
The effectiveness of existing malicious NPM and PyPI package detection approaches is hindered by two challenges.
We propose and implement Cerebro to detect malicious packages in NPM and PyPI.
arXiv Detail & Related papers (2023-09-06T00:58:59Z) - PyPOTS: A Python Toolbox for Data Mining on Partially-Observed Time
Series [0.0]
PyPOTS is an open-source Python library dedicated to data mining and analysis on partially-observed time series.
It provides easy access to diverse algorithms categorized into four tasks: imputation, classification, clustering, and forecasting.
arXiv Detail & Related papers (2023-05-30T07:57:05Z) - DADApy: Distance-based Analysis of DAta-manifolds in Python [51.37841707191944]
DADApy is a python software package for analysing and characterising high-dimensional data.
It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics.
arXiv Detail & Related papers (2022-05-04T08:41:59Z) - Interactive Visualization of Protein RINs using NetworKit in the Cloud [57.780880387925954]
In this paper, we consider an example from protein dynamics, specifically residue interaction networks (RINs)
We use NetworKit to build a cloud-based environment that enables domain scientists to run their visualization and analysis on large compute servers.
To demonstrate the versatility of this approach, we use it to build a custom Jupyter-based widget for RIN visualization.
arXiv Detail & Related papers (2022-03-02T17:41:45Z) - PyHHMM: A Python Library for Heterogeneous Hidden Markov Models [63.01207205641885]
PyHHMM is an object-oriented Python implementation of Heterogeneous-Hidden Markov Models (HHMMs)
PyHHMM emphasizes features not supported in similar available frameworks: a heterogeneous observation model, missing data inference, different model order selection criterias, and semi-supervised training.
PyHHMM relies on the numpy, scipy, scikit-learn, and seaborn Python packages, and is distributed under the Apache-2.0 License.
arXiv Detail & Related papers (2022-01-12T07:32:36Z) - mvlearn: Multiview Machine Learning in Python [103.55817158943866]
mvlearn is a Python library which implements the leading multiview machine learning methods.
The package can be installed from Python Package Index (PyPI) and the conda package manager.
arXiv Detail & Related papers (2020-05-25T02:35:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.