An Empirical Study of Malicious Code In PyPI Ecosystem
- URL: http://arxiv.org/abs/2309.11021v1
- Date: Wed, 20 Sep 2023 02:51:02 GMT
- Title: An Empirical Study of Malicious Code In PyPI Ecosystem
- Authors: Wenbo Guo, Zhengzi Xu, Chengwei Liu, Cheng Huang, Yong Fang, Yang Liu
- Abstract summary: PyPI provides a convenient and accessible package management platform to developers.
The rapid development of the PyPI ecosystem has led to a severe problem of malicious package propagation.
We conduct an empirical study to understand the characteristics and current state of the malicious code lifecycle in the PyPI ecosystem.
- Score: 15.739368369031277
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: PyPI provides a convenient and accessible package management platform to
developers, enabling them to quickly implement specific functions and improve
work efficiency. However, the rapid development of the PyPI ecosystem has led
to a severe problem of malicious package propagation. Malicious developers
disguise malicious packages as normal, posing a significant security risk to
end-users.
To this end, we conducted an empirical study to understand the
characteristics and current state of the malicious code lifecycle in the PyPI
ecosystem. We first built an automated data collection framework and collated a
multi-source malicious code dataset containing 4,669 malicious package files.
We preliminarily classified these malicious code into five categories based on
malicious behaviour characteristics. Our research found that over 50% of
malicious code exhibits multiple malicious behaviours, with information
stealing and command execution being particularly prevalent. In addition, we
observed several novel attack vectors and anti-detection techniques. Our
analysis revealed that 74.81% of all malicious packages successfully entered
end-user projects through source code installation, thereby increasing security
risks. A real-world investigation showed that many reported malicious packages
persist in PyPI mirror servers globally, with over 72% remaining for an
extended period after being discovered. Finally, we sketched a portrait of the
malicious code lifecycle in the PyPI ecosystem, effectively reflecting the
characteristics of malicious code at different stages. We also present some
suggested mitigations to improve the security of the Python open-source
ecosystem.
Related papers
- RedCode: Risky Code Execution and Generation Benchmark for Code Agents [50.81206098588923]
RedCode is a benchmark for risky code execution and generation.
RedCode-Exec provides challenging prompts that could lead to risky code execution.
RedCode-Gen provides 160 prompts with function signatures and docstrings as input to assess whether code agents will follow instructions.
arXiv Detail & Related papers (2024-11-12T13:30:06Z) - Seeker: Enhancing Exception Handling in Code with LLM-based Multi-Agent Approach [54.03528377384397]
In real world software development, improper or missing exception handling can severely impact the robustness and reliability of code.
We explore the use of large language models (LLMs) to improve exception handling in code.
We propose Seeker, a multi agent framework inspired by expert developer strategies for exception handling.
arXiv Detail & Related papers (2024-10-09T14:45:45Z) - Towards Robust Detection of Open Source Software Supply Chain Poisoning Attacks in Industry Environments [9.29518367616395]
We present OSCAR, a dynamic code poisoning detection pipeline for NPM and PyPI ecosystems.
OSCAR fully executes packages in a sandbox environment, employs fuzz testing on exported functions and classes, and implements aspect-based behavior monitoring.
We evaluate OSCAR against six existing tools using a comprehensive benchmark dataset of real-world malicious and benign packages.
arXiv Detail & Related papers (2024-09-14T08:01:43Z) - The Impact of SBOM Generators on Vulnerability Assessment in Python: A Comparison and a Novel Approach [56.4040698609393]
Software Bill of Materials (SBOM) has been promoted as a tool to increase transparency and verifiability in software composition.
Current SBOM generation tools often suffer from inaccuracies in identifying components and dependencies.
We propose PIP-sbom, a novel pip-inspired solution that addresses their shortcomings.
arXiv Detail & Related papers (2024-09-10T10:12:37Z) - An Empirical Study on Package-Level Deprecation in Python Ecosystem [6.0347124337922144]
Python, a widely adopted programming language, is renowned for its extensive and diverse third-party package ecosystem.
A significant number of OSS packages within the Python ecosystem are in poor maintenance, leading to potential risks in functionality and security.
This paper investigates the current practices of announcing, receiving, and handling package-level deprecation in the Python ecosystem.
arXiv Detail & Related papers (2024-08-19T18:08:21Z) - Malicious Package Detection using Metadata Information [0.272760415353533]
We introduce a metadata-based malicious package detection model, MeMPtec.
MeMPtec extracts a set of features from package metadata information.
Our experiments indicate a significant reduction in both false positives and false negatives.
arXiv Detail & Related papers (2024-02-12T06:54:57Z) - On the Feasibility of Cross-Language Detection of Malicious Packages in
npm and PyPI [6.935278888313423]
Malicious users started to spread malware by publishing open-source packages containing malicious code.
Recent works apply machine learning techniques to detect malicious packages in the npm ecosystem.
We present a novel approach that involves a set of language-independent features and the training of models capable of detecting malicious packages in npm and PyPI.
arXiv Detail & Related papers (2023-10-14T12:32:51Z) - Malicious Package Detection in NPM and PyPI using a Single Model of
Malicious Behavior Sequence [7.991922551051611]
Package registries NPM and PyPI have been flooded with malicious packages.
The effectiveness of existing malicious NPM and PyPI package detection approaches is hindered by two challenges.
We propose and implement Cerebro to detect malicious packages in NPM and PyPI.
arXiv Detail & Related papers (2023-09-06T00:58:59Z) - On the Security Blind Spots of Software Composition Analysis [46.1389163921338]
We present a novel approach to detect vulnerable clones in the Maven repository.
We retrieve over 53k potential vulnerable clones from Maven Central.
We detect 727 confirmed vulnerable clones and synthesize a testable proof-of-vulnerability project for each of those.
arXiv Detail & Related papers (2023-06-08T20:14:46Z) - FAT Forensics: A Python Toolbox for Implementing and Deploying Fairness,
Accountability and Transparency Algorithms in Predictive Systems [69.24490096929709]
We developed an open source Python package called FAT Forensics.
It can inspect important fairness, accountability and transparency aspects of predictive algorithms.
Our toolbox can evaluate all elements of a predictive pipeline.
arXiv Detail & Related papers (2022-09-08T13:25:02Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.