Related papers: An Empirical Analysis of the R Package Ecosystem

An Empirical Analysis of the R Package Ecosystem

URL: http://arxiv.org/abs/2102.09904v1
Date: Fri, 19 Feb 2021 12:55:18 GMT
Title: An Empirical Analysis of the R Package Ecosystem
Authors: Ethan Bommarito, Michael J Bommarito II
Abstract summary: We analyze more than 25,000 packages, 150,000 releases, and 15 million files across two decades. We find that the historical growth of the ecosystem has been robust under all measures.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this research, we present a comprehensive, longitudinal empirical summary of the R package ecosystem, including not just CRAN, but also Bioconductor and GitHub. We analyze more than 25,000 packages, 150,000 releases, and 15 million files across two decades, providing comprehensive counts and trends for common metrics across packages, releases, authors, licenses, and other important metadata. We find that the historical growth of the ecosystem has been robust under all measures, with a compound annual growth rate of 29% for active packages, 28% for new releases, and 26% for active maintainers. As with many similar social systems, we find a number of highly right-skewed distributions with practical implications, including the distribution of releases per package, packages and releases per author or maintainer, package and maintainer dependency in-degree, and size per package and release. For example, the top five packages are imported by nearly 25% of all packages, and the top ten maintainers support packages that are imported by over half of all packages. We also highlight the dynamic nature of the ecosystem, recording both dramatic acceleration and notable deceleration in the growth of R. From a licensing perspective, we find a notable majority of packages are distributed under copyleft licensing or omit licensing information entirely. The data, methods, and calculations herein provide an anchor for public discourse and industry decisions related to R and CRAN, serving as a foundation for future research on the R software ecosystem and "data science" more broadly.

Related papers

Why Authors and Maintainers Link (or Don't Link) Their PyPI Libraries to Code Repositories and Donation Platforms [83.16077040470975]
Metadata of libraries on the Python Package Index (PyPI) plays a critical role in supporting the transparency, trust, and sustainability of open-source libraries.<n>This paper presents a large-scale empirical study combining two targeted surveys sent to 50,000 PyPI authors and maintainers.<n>We analyze more than 1,400 responses using large language model (LLM)-based topic modeling to uncover key motivations and barriers related to linking repositories and donation platforms.
arXiv Detail & Related papers (2026-01-21T16:13:57Z)
Analyzing the Availability of E-Mail Addresses for PyPI Libraries [89.21869606965578]
81.6% of libraries include at least one valid e-mail address, with PyPI serving as the primary source.<n>We identify over 698,000 invalid entries, primarily due to missing fields.
arXiv Detail & Related papers (2026-01-20T14:54:58Z)
RepoGenesis: Benchmarking End-to-End Microservice Generation from Readme to Repository [52.98970048197381]
RepoGenesis is the first multilingual benchmark for repository-level end-to-end web microservice generation.<n>It consists of 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, with 1,258 API endpoints and 2,335 test cases verified.<n>Results reveal that despite high AC (up to 73.91%) and DSR (up to 100%), the best-performing system achieves only 23.67% Pass@1 on Python and 21.45% on Java.
arXiv Detail & Related papers (2026-01-20T13:19:20Z)
Insecure Ingredients? Exploring Dependency Update Patterns of Bundled JavaScript Packages on the Web [0.0]
We present Aletheia, a package-agnostic method which dissects JavaScript bundles to identify package versions.<n>We crawl the Tranco top 100,000 domains to reveal that 5% - 20% of domains update their dependencies within 16 weeks.
arXiv Detail & Related papers (2025-12-17T13:43:32Z)
LLMs as Packagers of HPC Software [2.195636219953539]
Tools such as Spack automate dependency resolution and environment management, but their effectiveness relies on manually written build recipes.<n>We introduce SpackIt, an end-to-end framework that combines repository analysis, retrieval of relevant examples, and iterative refinement through diagnostic feedback.<n>Our results show that SpackIt increases installation success from 20% in a zero-shot setting to over 80% in its best configuration.
arXiv Detail & Related papers (2025-11-07T00:06:51Z)
Replication Packages in Software Engineering Secondary Studies: A Systematic Mapping [0.9421843976231371]
Systematic reviews (SRs) summarize state-of-the-art evidence in science, including software engineering (SE) We examined 528 secondary studies published between 2013 and 2023 to analyze the availability and reporting of replication packages.
arXiv Detail & Related papers (2025-04-17T05:11:39Z)
Cream of the Crop: Harvesting Rich, Scalable and Transferable Multi-Modal Data for Instruction Fine-Tuning [59.56171041796373]
We harvest multi-modal instructional data in a robust and efficient manner. We take interaction style as diversity indicator and use a multi-modal rich styler to identify data instruction patterns. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods.
arXiv Detail & Related papers (2025-03-17T17:11:22Z)
Rethinking Reuse in Dependency Supply Chains: Initial Analysis of NPM packages at the End of the Chain [2.4969046521751768]
This paper advocates for a shift in software development practices toward minimizing reliance on third-party packages. We find that these end-of-chain packages offer unique insights, as they play a key role in the ecosystem.
arXiv Detail & Related papers (2025-03-04T17:26:34Z)
Towards a Classification of Open-Source ML Models and Datasets for Software Engineering [52.257764273141184]
Open-source Pre-Trained Models (PTMs) and datasets provide extensive resources for various Machine Learning (ML) tasks. These resources lack a classification tailored to Software Engineering (SE) needs. We apply an SE-oriented classification to PTMs and datasets on a popular open-source ML repository, Hugging Face (HF), and analyze the evolution of PTMs over time.
arXiv Detail & Related papers (2024-11-14T18:52:05Z)
Measuring Software Innovation with Open Source Software Development Data [0.0]
This paper introduces a novel measure of software innovation based on open source software (OSS) development activity on GitHub. We examine the dependency growth and release complexity among $sim$200,000 unique releases from 28,000 unique packages over two years post-release. We conclude that major releases of OSS packages count as a unit of innovation complementary to scientific publications, patents, and standards.
arXiv Detail & Related papers (2024-11-07T19:11:32Z)
A First Look at Package-to-Group Mechanism: An Empirical Study of the Linux Distributions [20.491275902894273]
A package-to-group mechanism (P2G) is employed to enable unified installation, uninstallation, and updates of multiple packages at once. This paper takes Linux distributions as a case study and presents an empirical study focusing on its application trends, evolutionary patterns, group quality, and developer tendencies.
arXiv Detail & Related papers (2024-10-14T03:48:20Z)
A Systematic Approach to Evaluating Development Activity in Heterogeneous Package Management Systems for Overall System Health Assessment [0.0]
We develop a method to identify packages within a Linux distribution that show low development activity between versions of the OSS projects included in a release. We use regular expressions to extract the epoch and upstream project major, minor, and patch versions for more than 6000 packages in the Ubuntu distribution.
arXiv Detail & Related papers (2024-09-06T19:58:20Z)
How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE) We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories. To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z)
The Stackage Repository: An Exploratory Study of its Evolution [0.0]
This paper conducts empirical research about the evolution of Stackage considering monad packages. To the best of our knowledge, this is the first large-scale analysis of the evolution of the Stackage repository regarding packages used and monads.
arXiv Detail & Related papers (2023-10-16T23:42:47Z)
Characterizing Deep Learning Package Supply Chains in PyPI: Domains, Clusters, and Disengagement [14.938727013935654]
Deep learning (DL) package supply chains are critical for DL frameworks to remain competitive. We analyze the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs. Our study provides rich implications on the maintenance and dependency management practices of PyPI DL SCs.
arXiv Detail & Related papers (2023-06-28T15:34:52Z)
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos [106.06278332186106]
Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction. Numerous limitations exist within existing public MSMO datasets. We have meticulously curated the textbfMMSum dataset.
arXiv Detail & Related papers (2023-06-07T07:43:11Z)
Promises and Perils of Mining Software Package Ecosystem Data [10.787686237395816]
Third-party packages have led to the emergence of large software package ecosystems with a maze of inter-dependencies. Understanding the infrastructure and dynamics of package ecosystems has given rise to approaches for better code reuse, automated updates, and the avoidance of vulnerabilities. In this chapter, we review promises and perils of mining the rich data related to software package ecosystems available to software engineering researchers.
arXiv Detail & Related papers (2023-05-29T03:09:48Z)
DADApy: Distance-based Analysis of DAta-manifolds in Python [51.37841707191944]
DADApy is a python software package for analysing and characterising high-dimensional data. It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics.
arXiv Detail & Related papers (2022-05-04T08:41:59Z)
Scikit-dimension: a Python package for intrinsic dimension estimation [58.8599521537]
This technical note introduces textttscikit-dimension, an open-source Python package for intrinsic dimension estimation. textttscikit-dimension package provides a uniform implementation of most of the known ID estimators based on scikit-learn application programming interface. We briefly describe the package and demonstrate its use in a large-scale (more than 500 datasets) benchmarking of methods for ID estimation in real-life and synthetic data.
arXiv Detail & Related papers (2021-09-06T16:46:38Z)
CO-Search: COVID-19 Information Retrieval with Semantic Search, Question Answering, and Abstractive Summarization [53.67205506042232]
CO-Search is a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature. To account for the domain-specific and relatively limited dataset, we generate a bipartite graph of document paragraphs and citations. We evaluate our system on the data of the TREC-COVID information retrieval challenge.
arXiv Detail & Related papers (2020-06-17T01:32:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.