npm-follower: A Complete Dataset Tracking the NPM Ecosystem
- URL: http://arxiv.org/abs/2308.12545v1
- Date: Thu, 24 Aug 2023 04:05:49 GMT
- Title: npm-follower: A Complete Dataset Tracking the NPM Ecosystem
- Authors: Donald Pinckney, Federico Cassano, Arjun Guha, Jonathan Bell
- Abstract summary: npm-follower is a dataset and crawling architecture which archives metadata and code of all packages and versions as they are published.
The dataset currently includes over 35 million versions of packages, and grows at a rate of about 1 million versions per month.
- Score: 5.931961380320841
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Software developers typically rely upon a large network of dependencies to
build their applications. For instance, the NPM package repository contains
over 3 million packages and serves tens of billions of downloads weekly.
Understanding the structure and nature of packages, dependencies, and published
code requires datasets that provide researchers with easy access to metadata
and code of packages. However, prior work on NPM dataset construction typically
has two limitations: 1) only metadata is scraped, and 2) packages or versions
that are deleted from NPM can not be scraped. Over 330,000 versions of packages
were deleted from NPM between July 2022 and May 2023. This data is critical for
researchers as it often pertains to important questions of security and
malware. We present npm-follower, a dataset and crawling architecture which
archives metadata and code of all packages and versions as they are published,
and is thus able to retain data which is later deleted. The dataset currently
includes over 35 million versions of packages, and grows at a rate of about 1
million versions per month. The dataset is designed to be easily used by
researchers answering questions involving either metadata or program analysis.
Both the code and dataset are available at https://dependencies.science.
Related papers
- MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens [113.9621845919304]
We release MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date.
MINT-1T comprises one trillion text tokens and 3.4 billion images, a 10x scale-up from existing open-source datasets.
Our experiments show that LMMs trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS.
arXiv Detail & Related papers (2024-06-17T07:21:36Z) - PyRadar: Towards Automatically Retrieving and Validating Source Code Repository Information for PyPI Packages [24.8919191161202]
Existing tools can only retrieve repository information for up to 70.5% of PyPI releases.
This paper proposes PyRadar, a novel framework that utilizes the metadata and source distribution to retrieve and validate the repository information for PyPI releases.
arXiv Detail & Related papers (2024-04-25T12:27:59Z) - CAM: A Collection of Snapshots of GitHub Java Repositories Together with
Metrics [0.0]
The latest archive of 2.2Gb that we published on the 2nd of March, 2024 includes 532K Java classes with 48 metrics for each class.
At least once a year, we execute the entire script, a process which requires a minimum of ten days on a very powerful server, to generate a new dataset.
arXiv Detail & Related papers (2024-03-13T12:52:57Z) - DONAPI: Malicious NPM Packages Detector using Behavior Sequence Knowledge Mapping [28.852274185512236]
npm is the most extensive package manager, hosting more than 2 million third-party open-source packages.
In this paper, we synchronize a local package cache containing more than 3.4 million packages in near real-time to give us access to more package code details.
We propose the DONAPI, an automatic malicious npm packages detector that combines static and dynamic analysis.
arXiv Detail & Related papers (2024-03-13T08:38:21Z) - The Stackage Repository: An Exploratory Study of its Evolution [0.0]
This paper conducts empirical research about the evolution of Stackage considering monad packages.
To the best of our knowledge, this is the first large-scale analysis of the evolution of the Stackage repository regarding packages used and monads.
arXiv Detail & Related papers (2023-10-16T23:42:47Z) - On the Feasibility of Cross-Language Detection of Malicious Packages in
npm and PyPI [6.935278888313423]
Malicious users started to spread malware by publishing open-source packages containing malicious code.
Recent works apply machine learning techniques to detect malicious packages in the npm ecosystem.
We present a novel approach that involves a set of language-independent features and the training of models capable of detecting malicious packages in npm and PyPI.
arXiv Detail & Related papers (2023-10-14T12:32:51Z) - DADApy: Distance-based Analysis of DAta-manifolds in Python [51.37841707191944]
DADApy is a python software package for analysing and characterising high-dimensional data.
It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics.
arXiv Detail & Related papers (2022-05-04T08:41:59Z) - PyHHMM: A Python Library for Heterogeneous Hidden Markov Models [63.01207205641885]
PyHHMM is an object-oriented Python implementation of Heterogeneous-Hidden Markov Models (HHMMs)
PyHHMM emphasizes features not supported in similar available frameworks: a heterogeneous observation model, missing data inference, different model order selection criterias, and semi-supervised training.
PyHHMM relies on the numpy, scipy, scikit-learn, and seaborn Python packages, and is distributed under the Apache-2.0 License.
arXiv Detail & Related papers (2022-01-12T07:32:36Z) - The Problem of Zombie Datasets:A Framework For Deprecating Datasets [55.878249096379804]
We examine the public afterlives of several prominent datasets, including ImageNet, 80 Million Tiny Images, MS-Celeb-1M, Duke MTMC, Brainwash, and HRT Transgender.
We propose a dataset deprecation framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocol, and publication checks.
arXiv Detail & Related papers (2021-10-18T20:13:51Z) - Datasets: A Community Library for Natural Language Processing [55.48866401721244]
datasets is a community library for contemporary NLP.
The library includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects.
arXiv Detail & Related papers (2021-09-07T03:59:22Z) - Sketch and Scale: Geo-distributed tSNE and UMAP [75.44887265789056]
Running machine learning analytics over geographically distributed datasets is a rapidly arising problem.
We introduce a novel framework: Sketch and Scale (SnS)
It leverages a Count Sketch data structure to compress the data on the edge nodes, aggregates the reduced size sketches on the master node, and runs vanilla tSNE or UMAP on the summary.
We show this technique to be fully parallel, scale linearly in time, logarithmically in memory, and communication, making it possible to analyze datasets with many millions, potentially billions of data points, spread across several data centers around the globe.
arXiv Detail & Related papers (2020-11-11T22:32:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.