Related papers: Fork Entropy: Assessing the Diversity of Open Source Software Projects' Forks

Fork Entropy: Assessing the Diversity of Open Source Software Projects' Forks

URL: http://arxiv.org/abs/2205.09931v2
Date: Tue, 19 Sep 2023 13:24:49 GMT
Title: Fork Entropy: Assessing the Diversity of Open Source Software Projects' Forks
Authors: Liang Wang, Zhiwen Zheng, Xiangchen Wu, Baihui Sang, Jierui Zhang, Xianping Tao
Abstract summary: We propose an approach to measure the diversity of an OSS project's forks (i.e., its fork population) We devise a novel fork entropy metric based on Rao's quadratic entropy to measure such diversity. With properties including symmetry, continuity, and monotonicity, the proposed fork entropy metric is effective in quantifying the diversity of a project's fork population.
Score: 5.731244417287598
License: http://creativecommons.org/licenses/by/4.0/
Abstract: On open source software (OSS) platforms such as GitHub, forking and accepting pull-requests is an important approach for OSS projects to receive contributions, especially from external contributors who cannot directly commit into the source repositories. Having a large number of forks is often considered as an indicator of a project being popular. While extensive studies have been conducted to understand the reasons of forking, communications between forks, features and impacts of forks, there are few quantitative measures that can provide a simple yet informative way to gain insights about an OSS project's forks besides their count. Inspired by studies on biodiversity and OSS team diversity, in this paper, we propose an approach to measure the diversity of an OSS project's forks (i.e., its fork population). We devise a novel fork entropy metric based on Rao's quadratic entropy to measure such diversity according to the forks' modifications to project files. With properties including symmetry, continuity, and monotonicity, the proposed fork entropy metric is effective in quantifying the diversity of a project's fork population. To further examine the usefulness of the proposed metric, we conduct empirical studies with data retrieved from fifty projects on GitHub. We observe significant correlations between a project's fork entropy and different outcome variables including the project's external productivity measured by the number of external contributors' commits, acceptance rate of external contributors' pull-requests, and the number of reported bugs. We also observe significant interactions between fork entropy and other factors such as the number of forks. The results suggest that fork entropy effectively enriches our understanding of OSS projects' forks beyond the simple number of forks, and can potentially support further research and applications.

Related papers

Analyzing the Usage of Donation Platforms for PyPI Libraries [91.97201077607862]
This study analyzes the adoption of donation platforms in the PyPI ecosystem. GitHub Sponsors is the dominant platform, though many PyPI-listed links are outdated.
arXiv Detail & Related papers (2025-03-11T10:27:31Z)
Darkit: A User-Friendly Software Toolkit for Spiking Large Language Model [50.37090759139591]
Large language models (LLMs) have been widely applied in various practical applications, typically comprising billions of parameters. The human brain, employing bio-plausible spiking mechanisms, can accomplish the same tasks while significantly reducing energy consumption. We are releasing a software toolkit named DarwinKit (Darkit) to accelerate the adoption of brain-inspired large language models.
arXiv Detail & Related papers (2024-12-20T07:50:08Z)
The New Dynamics of Open Source: Relicensing, Forks, & Community Impact [0.0]
Vendors are relicensing popular open source projects to more restrictive licenses in the hopes of generating more revenue. This research compares organizational affiliation data from three case studies based on license changes that resulted in forks. Research indicates that the forks resulting from these relicensing events have more organizational diversity than the original projects.
arXiv Detail & Related papers (2024-11-07T14:21:45Z)
Impact of the Availability of ChatGPT on Software Development: A Synthetic Difference in Differences Estimation using GitHub Data [49.1574468325115]
ChatGPT is an AI tool that enhances software production efficiency. We estimate ChatGPT's effects on the number of git pushes, repositories, and unique developers per 100,000 people. These results suggest that AI tools like ChatGPT can substantially boost developer productivity, though further analysis is needed to address potential downsides such as low quality code and privacy concerns.
arXiv Detail & Related papers (2024-06-16T19:11:15Z)
A Novel Approach for Automated Design Information Mining from Issue Logs [3.5665328754813768]
DRMiner is a novel method to automatically mine latent design rationales from developers' live discussion in open-source community. We acquire issue logs from Cassandra, Flink, and Solr repositories in Jira, and then annotate and process them under a rigorous scheme. DRMiner achieves an F1 score of 65% for mining design rationales, outperforming all baselines with a 7% improvement over GPT-4.0.
arXiv Detail & Related papers (2024-05-30T02:20:04Z)
A Unified Causal View of Instruction Tuning [76.1000380429553]
We develop a meta Structural Causal Model (meta-SCM) to integrate different NLP tasks under a single causal structure of the data. Key idea is to learn task-required causal factors and only use those to make predictions for a given task.
arXiv Detail & Related papers (2024-02-09T07:12:56Z)
Mitigating Shortcut Learning with Diffusion Counterfactuals and Diverse Ensembles [95.49699178874683]
We propose DiffDiv, an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs) We show that DPMs can generate images with novel feature combinations, even when trained on samples displaying correlated input features. We show that DPM-guided diversification is sufficient to remove dependence on shortcut cues, without a need for additional supervised signals.
arXiv Detail & Related papers (2023-11-23T15:47:33Z)
Towards a Structural Equation Model of Open Source Blockchain Software Health [0.0]
This work uses exploratory factor analysis to identify latent constructs that are representative of general public interest or popularity in software. We find that interest is a combination of stars, forks, and text mentions in the GitHub repository, while a second factor for robustness is composed of a criticality score. A structural model of software health is proposed such that general interest positively influences developer engagement, which, in turn, positively predicts software robustness.
arXiv Detail & Related papers (2023-10-31T08:47:41Z)
Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models [57.933500846742234]
Recent work recognizes that structured outliers are the critical bottleneck for quantization performance. We propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping. This framework effectively suppresses the outliers and can be used in a plug-and-play mode.
arXiv Detail & Related papers (2022-09-27T12:05:59Z)
LAGOON: An Analysis Tool for Open Source Communities [7.3861897382622015]
LAGOON is an open source platform for understanding the ecosystems of Open Source Software (OSS) communities. LAGOON ingests artifacts from several common sources, including source code repositories, issue trackers, mailing lists and scraping content from websites. A user interface is provided for visualization and exploration of an OSS project's complete sociotechnical graph.
arXiv Detail & Related papers (2022-01-26T18:52:11Z)
MetaKernel: Learning Variational Random Features with Limited Labels [120.90737681252594]
Few-shot learning deals with the fundamental and challenging problem of learning from a few annotated samples, while being able to generalize well on new tasks. We propose meta-learning kernels with random Fourier features for few-shot learning, we call Meta Kernel.
arXiv Detail & Related papers (2021-05-08T21:24:09Z)
Which contributions count? Analysis of attribution in open source [0.0]
We characterize contributor acknowledgment models in open source by analyzing thousands of projects. We find that community-generated systems of contribution acknowledgment make work like idea generation or bug finding more visible.
arXiv Detail & Related papers (2021-03-19T20:14:40Z)
KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT) All tasks in KILT are grounded in the same snapshot of Wikipedia. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.