A Machine Learning-Based Approach For Detecting Malicious PyPI Packages
- URL: http://arxiv.org/abs/2412.05259v1
- Date: Fri, 06 Dec 2024 18:49:06 GMT
- Title: A Machine Learning-Based Approach For Detecting Malicious PyPI Packages
- Authors: Haya Samaana, Diego Elias Costa, Emad Shihab, Ahmad Abdellatif,
- Abstract summary: In modern software development, the use of external libraries and packages is increasingly prevalent.
This reliance on reusing code introduces serious risks for deployed software in the form of malicious packages.
We propose a data-driven approach that uses machine learning and static analysis to examine the package's metadata, code, files, and textual characteristics.
- Score: 4.311626046942916
- License:
- Abstract: Background. In modern software development, the use of external libraries and packages is increasingly prevalent, streamlining the software development process and enabling developers to deploy feature-rich systems with little coding. While this reliance on reusing code offers substantial benefits, it also introduces serious risks for deployed software in the form of malicious packages - harmful and vulnerable code disguised as useful libraries. Aims. Popular ecosystems, such PyPI, receive thousands of new package contributions every week, and distinguishing safe contributions from harmful ones presents a significant challenge. There is a dire need for reliable methods to detect and address the presence of malicious packages in these environments. Method. To address these challenges, we propose a data-driven approach that uses machine learning and static analysis to examine the package's metadata, code, files, and textual characteristics to identify malicious packages. Results. In evaluations conducted within the PyPI ecosystem, we achieved an F1-measure of 0.94 for identifying malicious packages using a stacking ensemble classifier. Conclusions. This tool can be seamlessly integrated into package vetting pipelines and has the capability to flag entire packages, not just malicious function calls. This enhancement strengthens security measures and reduces the manual workload for developers and registry maintainers, thereby contributing to the overall integrity of the ecosystem.
Related papers
- The Impact of SBOM Generators on Vulnerability Assessment in Python: A Comparison and a Novel Approach [56.4040698609393]
Software Bill of Materials (SBOM) has been promoted as a tool to increase transparency and verifiability in software composition.
Current SBOM generation tools often suffer from inaccuracies in identifying components and dependencies.
We propose PIP-sbom, a novel pip-inspired solution that addresses their shortcomings.
arXiv Detail & Related papers (2024-09-10T10:12:37Z) - An Empirical Study on Package-Level Deprecation in Python Ecosystem [6.0347124337922144]
Python, a widely adopted programming language, is renowned for its extensive and diverse third-party package ecosystem.
A significant number of OSS packages within the Python ecosystem are in poor maintenance, leading to potential risks in functionality and security.
This paper investigates the current practices of announcing, receiving, and handling package-level deprecation in the Python ecosystem.
arXiv Detail & Related papers (2024-08-19T18:08:21Z) - How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE)
We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories.
To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - A Large-scale Fine-grained Analysis of Packages in Open-Source Software Ecosystems [13.610690659041417]
Malicious packages have less metadata content and utilize fewer static and dynamic functions than legitimate ones.
One dimension in fine-grained information (FGI) has sufficient distinguishable capability to detect malicious packages.
arXiv Detail & Related papers (2024-04-17T15:16:01Z) - Malicious Package Detection using Metadata Information [0.272760415353533]
We introduce a metadata-based malicious package detection model, MeMPtec.
MeMPtec extracts a set of features from package metadata information.
Our experiments indicate a significant reduction in both false positives and false negatives.
arXiv Detail & Related papers (2024-02-12T06:54:57Z) - An Empirical Study of Malicious Code In PyPI Ecosystem [15.739368369031277]
PyPI provides a convenient and accessible package management platform to developers.
The rapid development of the PyPI ecosystem has led to a severe problem of malicious package propagation.
We conduct an empirical study to understand the characteristics and current state of the malicious code lifecycle in the PyPI ecosystem.
arXiv Detail & Related papers (2023-09-20T02:51:02Z) - VulLibGen: Generating Names of Vulnerability-Affected Packages via a Large Language Model [13.96251273677855]
VulLibGen is a method to directly generate affected packages.
It has an average accuracy of 0.806 for identifying vulnerable packages.
We have submitted 60 vulnerability, affected package> pairs to GitHub Advisory.
arXiv Detail & Related papers (2023-08-09T02:02:46Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - Dos and Don'ts of Machine Learning in Computer Security [74.1816306998445]
Despite great potential, machine learning in security is prone to subtle pitfalls that undermine its performance.
We identify common pitfalls in the design, implementation, and evaluation of learning-based security systems.
We propose actionable recommendations to support researchers in avoiding or mitigating the pitfalls where possible.
arXiv Detail & Related papers (2020-10-19T13:09:31Z) - SafePILCO: a software tool for safe and data-efficient policy synthesis [67.17251247987187]
SafePILCO is a software tool for safe and data-efficient policy search with reinforcement learning.
It extends the known PILCO algorithm, originally written in Python, to support safe learning.
arXiv Detail & Related papers (2020-08-07T17:17:30Z) - Autosploit: A Fully Automated Framework for Evaluating the
Exploitability of Security Vulnerabilities [47.748732208602355]
Autosploit is an automated framework for evaluating the exploitability of vulnerabilities.
It automatically tests the exploits on different configurations of the environment.
It is able to identify the system properties that affect the ability to exploit a vulnerability in both noiseless and noisy environments.
arXiv Detail & Related papers (2020-06-30T18:49:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.