Detecting and Characterizing Low and No Functionality Packages in the NPM Ecosystem
- URL: http://arxiv.org/abs/2510.04495v1
- Date: Mon, 06 Oct 2025 05:11:49 GMT
- Title: Detecting and Characterizing Low and No Functionality Packages in the NPM Ecosystem
- Authors: Napasorn Tevarut, Brittany Reid, Yutaro Kashiwa, Pattara Leelaprute, Arnon Rungsawang, Bundit Manaskasemsak, Hajimu Iida,
- Abstract summary: Trivial packages, small modules with low functionality, are common in the npm ecosystem.<n>This paper refines existing definitions and introduces data-only packages that contain no executable logic.<n>A rule-based static analysis method is developed to detect trivial and data-only packages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Trivial packages, small modules with low functionality, are common in the npm ecosystem and can pose security risks despite their simplicity. This paper refines existing definitions and introduce data-only packages that contain no executable logic. A rule-based static analysis method is developed to detect trivial and data-only packages and evaluate their prevalence and associated risks in the 2025 npm ecosystem. The analysis shows that 17.92% of packages are trivial, with vulnerability levels comparable to non-trivial ones, and data-only packages, though rare, also contain risks. The proposed detection tool achieves 94% accuracy (macro-F1 0.87), enabling effective large-scale analysis to reduce security exposure. This findings suggest that trivial and data-only packages warrant greater attention in dependency management to reduce potential technical debt and security exposure.
Related papers
- Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection [105.14032334647932]
Machine-generated texts (MGTs) pose risks such as disinformation and phishing, highlighting the need for reliable detection.<n> Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting.<n>We propose a Markov-informed score calibration strategy that models two relationships of context detection scores that may aid calibration.
arXiv Detail & Related papers (2026-02-08T16:06:12Z) - Rethinking Evaluation of Infrared Small Target Detection [105.59753496831739]
This paper introduces a hybrid-level metric incorporating pixel- and target-level performance, proposing a systematic error analysis method, and emphasizing the importance of cross-dataset evaluation.<n>An open-source toolkit has be released to facilitate standardized benchmarking.
arXiv Detail & Related papers (2025-09-21T02:45:07Z) - MalGuard: Towards Real-Time, Accurate, and Actionable Detection of Malicious Packages in PyPI Ecosystem [11.834078597426409]
Malicious package detection has become a critical task in ensuring the security and stability of the PyPI.<n>Existing detection approaches have focused on advancing model selection, evolving from traditional machine learning (ML) models to large language models (LLMs)<n>We propose a novel approach MalGuard based on graph centrality analysis and the LIME (Local Interpretable Model-agnostic Explanations) algorithm to detect malicious packages.
arXiv Detail & Related papers (2025-06-17T12:30:56Z) - Mono: Is Your "Clean" Vulnerability Dataset Really Solvable? Exposing and Trapping Undecidable Patches and Beyond [10.072175823846973]
Existing security patches often suffer from inaccurate labels, insufficient contextual information, and undecidable patches.<n>We present mono, a novel framework that simulates human experts' reasoning process to construct reliable vulnerability datasets.<n> mono can correct 31.0% of labeling errors, recover 89% of inter-procedural vulnerabilities, and reveals that 16.7% of CVEs contain undecidable patches.
arXiv Detail & Related papers (2025-06-04T07:43:04Z) - A Machine Learning-Based Approach For Detecting Malicious PyPI Packages [4.311626046942916]
In modern software development, the use of external libraries and packages is increasingly prevalent.<n>This reliance on reusing code introduces serious risks for deployed software in the form of malicious packages.<n>We propose a data-driven approach that uses machine learning and static analysis to examine the package's metadata, code, files, and textual characteristics.
arXiv Detail & Related papers (2024-12-06T18:49:06Z) - Probably Approximately Precision and Recall Learning [62.912015491907994]
Precision and Recall are foundational metrics in machine learning.
One-sided feedback--where only positive examples are observed during training--is inherent in many practical problems.
We introduce a PAC learning framework where each hypothesis is represented by a graph, with edges indicating positive interactions.
arXiv Detail & Related papers (2024-11-20T04:21:07Z) - A Large-scale Fine-grained Analysis of Packages in Open-Source Software Ecosystems [13.610690659041417]
Malicious packages have less metadata content and utilize fewer static and dynamic functions than legitimate ones.
One dimension in fine-grained information (FGI) has sufficient distinguishable capability to detect malicious packages.
arXiv Detail & Related papers (2024-04-17T15:16:01Z) - Malicious Package Detection using Metadata Information [0.272760415353533]
We introduce a metadata-based malicious package detection model, MeMPtec.
MeMPtec extracts a set of features from package metadata information.
Our experiments indicate a significant reduction in both false positives and false negatives.
arXiv Detail & Related papers (2024-02-12T06:54:57Z) - Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative.
We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z) - Is Vertical Logistic Regression Privacy-Preserving? A Comprehensive
Privacy Analysis and Beyond [57.10914865054868]
We consider vertical logistic regression (VLR) trained with mini-batch descent gradient.
We provide a comprehensive and rigorous privacy analysis of VLR in a class of open-source Federated Learning frameworks.
arXiv Detail & Related papers (2022-07-19T05:47:30Z) - Differential privacy and robust statistics in high dimensions [49.50869296871643]
High-dimensional Propose-Test-Release (HPTR) builds upon three crucial components: the exponential mechanism, robust statistics, and the Propose-Test-Release mechanism.
We show that HPTR nearly achieves the optimal sample complexity under several scenarios studied in the literature.
arXiv Detail & Related papers (2021-11-12T06:36:40Z) - Conservative Policy Construction Using Variational Autoencoders for
Logged Data with Missing Values [77.99648230758491]
We consider the problem of constructing personalized policies using logged data when there are missing values in the attributes of features.
The goal is to recommend an action when $Xt$, a degraded version of $Xb$ with missing values, is observed.
In particular, we introduce the textitconservative strategy where the policy is designed to safely handle the uncertainty due to missingness.
arXiv Detail & Related papers (2021-09-08T16:09:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.