Related papers: Mind the Gap: Evaluating LLMs for High-Level Malicious Package Detection vs. Fine-Grained Indicator Identification

Mind the Gap: Evaluating LLMs for High-Level Malicious Package Detection vs. Fine-Grained Indicator Identification

URL: http://arxiv.org/abs/2602.16304v1
Date: Wed, 18 Feb 2026 09:36:46 GMT
Title: Mind the Gap: Evaluating LLMs for High-Level Malicious Package Detection vs. Fine-Grained Indicator Identification
Authors: Ahmed Ryan, Ibrahim Khalil, Abdullah Al Jahid, Md Erfan, Akond Ashfaque Ur Rahman, Md Rayhanur Rahman,
Abstract summary: Large Language Models (LLMs) have emerged as a promising tool for automated security tasks.<n>This paper presents a systematic evaluation of 13 LLMs for detecting malicious software packages.
Score: 1.1103813686369686
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The prevalence of malicious packages in open-source repositories, such as PyPI, poses a critical threat to the software supply chain. While Large Language Models (LLMs) have emerged as a promising tool for automated security tasks, their effectiveness in detecting malicious packages and indicators remains underexplored. This paper presents a systematic evaluation of 13 LLMs for detecting malicious software packages. Using a curated dataset of 4,070 packages (3,700 benign and 370 malicious), we evaluate model performance across two tasks: binary classification (package detection) and multi-label classification (identification of specific malicious indicators). We further investigate the impact of prompting strategies, temperature settings, and model specifications on detection accuracy. We find a significant "granularity gap" in LLMs' capabilities. While GPT-4.1 achieves near-perfect performance in binary detection (F1 $\approx$ 0.99), performance degrades by approximately 41\% when the task shifts to identifying specific malicious indicators. We observe that general models are best for filtering out the majority of threats, while specialized coder models are better at detecting attacks that follow a strict, predictable code structure. Our correlation analysis indicates that parameter size and context width have negligible explanatory power regarding detection accuracy. We conclude that while LLMs are powerful detectors at the package level, they lack the semantic depth required for precise identification at the granular indicator level.

Related papers

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift [0.0]
We present a comprehensive analysis using a benchmark of 18 datasets spanning harmful requests, jailbreaks, indirect prompt injections, and extraction attacks.<n>We propose Leave-One-Dataset-Out (LODO) evaluation to measure true out-of-distribution generalization.
arXiv Detail & Related papers (2026-02-15T14:21:43Z)
Cutting the Gordian Knot: Detecting Malicious PyPI Packages via a Knowledge-Mining Framework [14.0015860172317]
Python Package Index (PyPI) has become a target for malicious actors.<n>Current detection tools generate false positive rates of 15-30%, incorrectly flagging one-third of legitimate packages as malicious.<n>We propose PyGuard, a knowledge-driven framework that converts detection failures into useful behavioral knowledge.
arXiv Detail & Related papers (2026-01-23T05:49:09Z)
Bridging Expert Reasoning and LLM Detection: A Knowledge-Driven Framework for Malicious Packages [10.858565849895314]
Open-source ecosystems such as NPM and PyPI are increasingly targeted by supply chain attacks.<n>We present IntelGuard, a retrieval-augmented generation (RAG) based framework that integrates expert analytical reasoning into automated malicious package detection.
arXiv Detail & Related papers (2026-01-23T05:31:12Z)
Many Hands Make Light Work: An LLM-based Multi-Agent System for Detecting Malicious PyPI Packages [3.7667883869699597]
Malicious code in open-source repositories such as PyPI poses a growing threat to software supply chains.<n>This paper presents LAMPS, a multi-agent system that employs collaborative language models to detect malicious PyPI packages.
arXiv Detail & Related papers (2026-01-17T19:43:22Z)
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z)
Towards Automated Error Discovery: A Study in Conversational AI [48.735443116662026]
We introduce Automated Error Discovery, a framework for detecting and defining errors in conversational AI.<n>We also propose SEEED (Soft Clustering Extended-Based Error Detection), as an encoder-based approach to its implementation.
arXiv Detail & Related papers (2025-09-13T14:53:22Z)
Backdoor Cleaning without External Guidance in MLLM Fine-tuning [76.82121084745785]
Believe Your Eyes (BYE) is a data filtering framework that leverages attention entropy patterns as self-supervised signals to identify and filter backdoor samples.<n>It achieves near-zero attack success rates while maintaining clean-task performance.
arXiv Detail & Related papers (2025-05-22T17:11:58Z)
Detecting Malicious Source Code in PyPI Packages with LLMs: Does RAG Come in Handy? [6.7341750484636975]
Malicious software packages in open-source ecosystems, such as PyPI, pose growing security risks.<n>In this work, we empirically evaluate the effectiveness of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), and few-shot learning for detecting malicious source code.
arXiv Detail & Related papers (2025-04-18T16:11:59Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models [95.09157454599605]
Large Language Models (LLMs) are becoming increasingly powerful, but they still exhibit significant but subtle weaknesses.<n>Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies.<n>We introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks.
arXiv Detail & Related papers (2024-06-24T15:16:45Z)
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks. This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs. We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.