Related papers: Many Hands Make Light Work: An LLM-based Multi-Agent System for Detecting Malicious PyPI Packages

Many Hands Make Light Work: An LLM-based Multi-Agent System for Detecting Malicious PyPI Packages

URL: http://arxiv.org/abs/2601.12148v3
Date: Sun, 25 Jan 2026 20:02:40 GMT
Title: Many Hands Make Light Work: An LLM-based Multi-Agent System for Detecting Malicious PyPI Packages
Authors: Muhammad Umar Zeshan, Motunrayo Ibiyo, Claudio Di Sipio, Phuong T. Nguyen, Davide Di Ruscio,
Abstract summary: Malicious code in open-source repositories such as PyPI poses a growing threat to software supply chains.<n>This paper presents LAMPS, a multi-agent system that employs collaborative language models to detect malicious PyPI packages.
Score: 3.7667883869699597
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Malicious code in open-source repositories such as PyPI poses a growing threat to software supply chains. Traditional rule-based tools often overlook the semantic patterns in source code that are crucial for identifying adversarial components. Large language models (LLMs) show promise for software analysis, yet their use in interpretable and modular security pipelines remains limited. This paper presents LAMPS, a multi-agent system that employs collaborative LLMs to detect malicious PyPI packages. The system consists of four role-specific agents for package retrieval, file extraction, classification, and verdict aggregation, coordinated through the CrewAI framework. A prototype combines a fine-tuned CodeBERT model for classification with LLaMA-3 agents for contextual reasoning. LAMPS has been evaluated on two complementary datasets: D1, a balanced collection of 6,000 setup.py files, and D2, a realistic multi-file dataset with 1,296 files and natural class imbalance. On D1, LAMPS achieves 97.7% accuracy, surpassing MPHunter--one of the state-of-the-art approaches. On D2, it reaches 99.5% accuracy and 99.5% balanced accuracy, outperforming RAG-based approaches and fine-tuned single-agent baselines. McNemar's test confirmed these improvements as highly significant. The results demonstrate the feasibility of distributed LLM reasoning for malicious code detection and highlight the benefits of modular multi-agent designs in software supply chain security.

Related papers

Mind the Gap: Evaluating LLMs for High-Level Malicious Package Detection vs. Fine-Grained Indicator Identification [1.1103813686369686]
Large Language Models (LLMs) have emerged as a promising tool for automated security tasks.<n>This paper presents a systematic evaluation of 13 LLMs for detecting malicious software packages.
arXiv Detail & Related papers (2026-02-18T09:36:46Z)
Every Step Counts: Decoding Trajectories as Authorship Fingerprints of dLLMs [63.82840470917859]
We show that the decoding mechanism of dLLMs can be used as a powerful tool for model attribution.<n>We propose a novel information extraction scheme called the Directed Decoding Map (DDM), which captures structural relationships between decoding steps and better reveals model-specific behaviors.
arXiv Detail & Related papers (2025-10-02T06:25:10Z)
Binary Diff Summarization using Large Language Models [17.877160310535942]
Large language models (LLMs) have been applied to binary analysis to augment traditional tools.<n>We propose a novel framework for binary diff summarization using LLMs.<n>We create a software supply chain security benchmark by injecting 3 different malware into 6 open-source projects.
arXiv Detail & Related papers (2025-09-28T16:47:24Z)
AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? [71.21547572568655]
AgenTracer-8B is a lightweight failure tracer trained with multi-granular reinforcement learning.<n>On the Who&When benchmark, AgenTracer-8B outperforms giant proprietary LLMs like Gemini-2.5-Pro and Claude-4-Sonnet by up to 18.18%.<n>AgenTracer-8B delivers actionable feedback to off-the-shelf multi-agent systems like MetaGPT and MaAS with 4.8-14.2% performance gains.
arXiv Detail & Related papers (2025-09-03T13:42:14Z)
MCP-Orchestrated Multi-Agent System for Automated Disinformation Detection [84.75972919995398]
This paper presents a multi-agent system that uses relation extraction to detect disinformation in news articles.<n>The proposed Agentic AI system combines four agents: (i) a machine learning agent (logistic regression), (ii) a Wikipedia knowledge check agent, and (iv) a web-scraped data analyzer.<n>Results demonstrate that the multi-agent ensemble achieves 95.3% accuracy with an F1 score of 0.964, significantly outperforming individual agents and traditional approaches.
arXiv Detail & Related papers (2025-08-13T19:14:48Z)
MalGuard: Towards Real-Time, Accurate, and Actionable Detection of Malicious Packages in PyPI Ecosystem [11.834078597426409]
Malicious package detection has become a critical task in ensuring the security and stability of the PyPI.<n>Existing detection approaches have focused on advancing model selection, evolving from traditional machine learning (ML) models to large language models (LLMs)<n>We propose a novel approach MalGuard based on graph centrality analysis and the LIME (Local Interpretable Model-agnostic Explanations) algorithm to detect malicious packages.
arXiv Detail & Related papers (2025-06-17T12:30:56Z)
Gradient-Based Model Fingerprinting for LLM Similarity Detection and Family Classification [6.008384763761687]
Large Language Models (LLMs) are integral software components in modern applications.<n>We presentGuard, a gradient-based fingerprinting framework for similarity detection and family classification.<n>Our approach extracts model-intrinsic behavioral signatures by analyzing responses to random input perturbations.<n>It supports the widely-adopted safetensors format and constructs high-dimensional fingerprints through statistical analysis of gradient features.
arXiv Detail & Related papers (2025-06-02T13:08:01Z)
LLM-Based Detection of Tangled Code Changes for Higher-Quality Method-Level Bug Datasets [8.166584296080805]
We investigate the utility of Large Language Models for detecting tangled code changes by leveraging both commit messages and method-level code diffs.<n>Our results demonstrate that combining commit messages with code diffs significantly enhances model performance.<n>Applying our approach to 49 open-source projects improves the distributional separability of code metrics between buggy and non-buggy methods.
arXiv Detail & Related papers (2025-05-13T06:26:13Z)
Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models.<n>Controlled Decoding provides a mechanism for aligning a model at inference time without retraining.<n>We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z)
Why Do Multi-Agent LLM Systems Fail? [87.90075668488434]
We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks.<n>We build the first Multi-Agent System Failure taxonomy (MAST)<n>We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent)
arXiv Detail & Related papers (2025-03-17T19:04:38Z)
SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution [56.9361004704428]
Large Language Models (LLMs) have demonstrated remarkable proficiency across a variety of complex tasks.<n>SWE-Fixer is a novel open-source framework designed to effectively and efficiently resolve GitHub issues.<n>We assess our approach on the SWE-Bench Lite and Verified benchmarks, achieving competitive performance among open-source models.
arXiv Detail & Related papers (2025-01-09T07:54:24Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [98.18244218156492]
Large Language Models (LLMs) have significantly advanced natural language processing.<n>As their applications expand into multi-agent environments, there arises a need for a comprehensive evaluation framework.<n>This work introduces a novel competition-based benchmark framework to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.