An Empirical Analysis of the R Package Ecosystem
- URL: http://arxiv.org/abs/2102.09904v1
- Date: Fri, 19 Feb 2021 12:55:18 GMT
- Title: An Empirical Analysis of the R Package Ecosystem
- Authors: Ethan Bommarito, Michael J Bommarito II
- Abstract summary: We analyze more than 25,000 packages, 150,000 releases, and 15 million files across two decades.
We find that the historical growth of the ecosystem has been robust under all measures.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this research, we present a comprehensive, longitudinal empirical summary
of the R package ecosystem, including not just CRAN, but also Bioconductor and
GitHub. We analyze more than 25,000 packages, 150,000 releases, and 15 million
files across two decades, providing comprehensive counts and trends for common
metrics across packages, releases, authors, licenses, and other important
metadata. We find that the historical growth of the ecosystem has been robust
under all measures, with a compound annual growth rate of 29% for active
packages, 28% for new releases, and 26% for active maintainers. As with many
similar social systems, we find a number of highly right-skewed distributions
with practical implications, including the distribution of releases per
package, packages and releases per author or maintainer, package and maintainer
dependency in-degree, and size per package and release. For example, the top
five packages are imported by nearly 25% of all packages, and the top ten
maintainers support packages that are imported by over half of all packages. We
also highlight the dynamic nature of the ecosystem, recording both dramatic
acceleration and notable deceleration in the growth of R. From a licensing
perspective, we find a notable majority of packages are distributed under
copyleft licensing or omit licensing information entirely. The data, methods,
and calculations herein provide an anchor for public discourse and industry
decisions related to R and CRAN, serving as a foundation for future research on
the R software ecosystem and "data science" more broadly.
Related papers
- A First Look at Package-to-Group Mechanism: An Empirical Study of the Linux Distributions [20.491275902894273]
A package-to-group mechanism (P2G) is employed to enable unified installation, uninstallation, and updates of multiple packages at once.
This paper takes Linux distributions as a case study and presents an empirical study focusing on its application trends, evolutionary patterns, group quality, and developer tendencies.
arXiv Detail & Related papers (2024-10-14T03:48:20Z) - A Systematic Approach to Evaluating Development Activity in Heterogeneous Package Management Systems for Overall System Health Assessment [0.0]
We develop a method to identify packages within a Linux distribution that show low development activity between versions of the OSS projects included in a release.
We use regular expressions to extract the epoch and upstream project major, minor, and patch versions for more than 6000 packages in the Ubuntu distribution.
arXiv Detail & Related papers (2024-09-06T19:58:20Z) - How to Understand Whole Software Repository? [64.19431011897515]
An excellent understanding of the whole repository will be the critical path to Automatic Software Engineering (ASE)
We develop a novel method named RepoUnderstander by guiding agents to comprehensively understand the whole repositories.
To better utilize the repository-level knowledge, we guide the agents to summarize, analyze, and plan.
arXiv Detail & Related papers (2024-06-03T15:20:06Z) - The Stackage Repository: An Exploratory Study of its Evolution [0.0]
This paper conducts empirical research about the evolution of Stackage considering monad packages.
To the best of our knowledge, this is the first large-scale analysis of the evolution of the Stackage repository regarding packages used and monads.
arXiv Detail & Related papers (2023-10-16T23:42:47Z) - Characterizing Deep Learning Package Supply Chains in PyPI: Domains,
Clusters, and Disengagement [14.938727013935654]
Deep learning (DL) package supply chains are critical for DL frameworks to remain competitive.
We analyze the domains, clusters, and disengagement of packages in two representative PyPI DL package SCs.
Our study provides rich implications on the maintenance and dependency management practices of PyPI DL SCs.
arXiv Detail & Related papers (2023-06-28T15:34:52Z) - MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation
of Videos [106.06278332186106]
Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction.
Numerous limitations exist within existing public MSMO datasets.
We have meticulously curated the textbfMMSum dataset.
arXiv Detail & Related papers (2023-06-07T07:43:11Z) - Promises and Perils of Mining Software Package Ecosystem Data [10.787686237395816]
Third-party packages have led to the emergence of large software package ecosystems with a maze of inter-dependencies.
Understanding the infrastructure and dynamics of package ecosystems has given rise to approaches for better code reuse, automated updates, and the avoidance of vulnerabilities.
In this chapter, we review promises and perils of mining the rich data related to software package ecosystems available to software engineering researchers.
arXiv Detail & Related papers (2023-05-29T03:09:48Z) - DADApy: Distance-based Analysis of DAta-manifolds in Python [51.37841707191944]
DADApy is a python software package for analysing and characterising high-dimensional data.
It provides methods for estimating the intrinsic dimension and the probability density, for performing density-based clustering and for comparing different distance metrics.
arXiv Detail & Related papers (2022-05-04T08:41:59Z) - Extending the WILDS Benchmark for Unsupervised Adaptation [186.90399201508953]
We present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data.
These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities.
We systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods.
arXiv Detail & Related papers (2021-12-09T18:32:38Z) - Scikit-dimension: a Python package for intrinsic dimension estimation [58.8599521537]
This technical note introduces textttscikit-dimension, an open-source Python package for intrinsic dimension estimation.
textttscikit-dimension package provides a uniform implementation of most of the known ID estimators based on scikit-learn application programming interface.
We briefly describe the package and demonstrate its use in a large-scale (more than 500 datasets) benchmarking of methods for ID estimation in real-life and synthetic data.
arXiv Detail & Related papers (2021-09-06T16:46:38Z) - CO-Search: COVID-19 Information Retrieval with Semantic Search, Question
Answering, and Abstractive Summarization [53.67205506042232]
CO-Search is a retriever-ranker semantic search engine designed to handle complex queries over the COVID-19 literature.
To account for the domain-specific and relatively limited dataset, we generate a bipartite graph of document paragraphs and citations.
We evaluate our system on the data of the TREC-COVID information retrieval challenge.
arXiv Detail & Related papers (2020-06-17T01:32:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.