Fork Entropy: Assessing the Diversity of Open Source Software Projects'
Forks
- URL: http://arxiv.org/abs/2205.09931v2
- Date: Tue, 19 Sep 2023 13:24:49 GMT
- Title: Fork Entropy: Assessing the Diversity of Open Source Software Projects'
Forks
- Authors: Liang Wang, Zhiwen Zheng, Xiangchen Wu, Baihui Sang, Jierui Zhang,
Xianping Tao
- Abstract summary: We propose an approach to measure the diversity of an OSS project's forks (i.e., its fork population)
We devise a novel fork entropy metric based on Rao's quadratic entropy to measure such diversity.
With properties including symmetry, continuity, and monotonicity, the proposed fork entropy metric is effective in quantifying the diversity of a project's fork population.
- Score: 5.731244417287598
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: On open source software (OSS) platforms such as GitHub, forking and accepting
pull-requests is an important approach for OSS projects to receive
contributions, especially from external contributors who cannot directly commit
into the source repositories. Having a large number of forks is often
considered as an indicator of a project being popular. While extensive studies
have been conducted to understand the reasons of forking, communications
between forks, features and impacts of forks, there are few quantitative
measures that can provide a simple yet informative way to gain insights about
an OSS project's forks besides their count. Inspired by studies on biodiversity
and OSS team diversity, in this paper, we propose an approach to measure the
diversity of an OSS project's forks (i.e., its fork population). We devise a
novel fork entropy metric based on Rao's quadratic entropy to measure such
diversity according to the forks' modifications to project files. With
properties including symmetry, continuity, and monotonicity, the proposed fork
entropy metric is effective in quantifying the diversity of a project's fork
population. To further examine the usefulness of the proposed metric, we
conduct empirical studies with data retrieved from fifty projects on GitHub. We
observe significant correlations between a project's fork entropy and different
outcome variables including the project's external productivity measured by the
number of external contributors' commits, acceptance rate of external
contributors' pull-requests, and the number of reported bugs. We also observe
significant interactions between fork entropy and other factors such as the
number of forks. The results suggest that fork entropy effectively enriches our
understanding of OSS projects' forks beyond the simple number of forks, and can
potentially support further research and applications.
Related papers
- The New Dynamics of Open Source: Relicensing, Forks, & Community Impact [0.0]
Vendors are relicensing popular open source projects to more restrictive licenses in the hopes of generating more revenue.
This research compares organizational affiliation data from three case studies based on license changes that resulted in forks.
Research indicates that the forks resulting from these relicensing events have more organizational diversity than the original projects.
arXiv Detail & Related papers (2024-11-07T14:21:45Z) - Impact of the Availability of ChatGPT on Software Development: A Synthetic Difference in Differences Estimation using GitHub Data [49.1574468325115]
ChatGPT is an AI tool that enhances software production efficiency.
We estimate ChatGPT's effects on the number of git pushes, repositories, and unique developers per 100,000 people.
These results suggest that AI tools like ChatGPT can substantially boost developer productivity, though further analysis is needed to address potential downsides such as low quality code and privacy concerns.
arXiv Detail & Related papers (2024-06-16T19:11:15Z) - A Novel Approach for Automated Design Information Mining from Issue Logs [3.5665328754813768]
DRMiner is a novel method to automatically mine latent design rationales from developers' live discussion in open-source community.
We acquire issue logs from Cassandra, Flink, and Solr repositories in Jira, and then annotate and process them under a rigorous scheme.
DRMiner achieves an F1 score of 65% for mining design rationales, outperforming all baselines with a 7% improvement over GPT-4.0.
arXiv Detail & Related papers (2024-05-30T02:20:04Z) - A Unified Causal View of Instruction Tuning [76.1000380429553]
We develop a meta Structural Causal Model (meta-SCM) to integrate different NLP tasks under a single causal structure of the data.
Key idea is to learn task-required causal factors and only use those to make predictions for a given task.
arXiv Detail & Related papers (2024-02-09T07:12:56Z) - Mitigating Shortcut Learning with Diffusion Counterfactuals and Diverse Ensembles [95.49699178874683]
We propose DiffDiv, an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs)
We show that DPMs can generate images with novel feature combinations, even when trained on samples displaying correlated input features.
We show that DPM-guided diversification is sufficient to remove dependence on shortcut cues, without a need for additional supervised signals.
arXiv Detail & Related papers (2023-11-23T15:47:33Z) - Towards a Structural Equation Model of Open Source Blockchain Software
Health [0.0]
This work uses exploratory factor analysis to identify latent constructs that are representative of general public interest or popularity in software.
We find that interest is a combination of stars, forks, and text mentions in the GitHub repository, while a second factor for robustness is composed of a criticality score.
A structural model of software health is proposed such that general interest positively influences developer engagement, which, in turn, positively predicts software robustness.
arXiv Detail & Related papers (2023-10-31T08:47:41Z) - Outlier Suppression: Pushing the Limit of Low-bit Transformer Language
Models [57.933500846742234]
Recent work recognizes that structured outliers are the critical bottleneck for quantization performance.
We propose an outlier suppression framework including two components: Gamma Migration and Token-Wise Clipping.
This framework effectively suppresses the outliers and can be used in a plug-and-play mode.
arXiv Detail & Related papers (2022-09-27T12:05:59Z) - LAGOON: An Analysis Tool for Open Source Communities [7.3861897382622015]
LAGOON is an open source platform for understanding the ecosystems of Open Source Software (OSS) communities.
LAGOON ingests artifacts from several common sources, including source code repositories, issue trackers, mailing lists and scraping content from websites.
A user interface is provided for visualization and exploration of an OSS project's complete sociotechnical graph.
arXiv Detail & Related papers (2022-01-26T18:52:11Z) - MetaKernel: Learning Variational Random Features with Limited Labels [120.90737681252594]
Few-shot learning deals with the fundamental and challenging problem of learning from a few annotated samples, while being able to generalize well on new tasks.
We propose meta-learning kernels with random Fourier features for few-shot learning, we call Meta Kernel.
arXiv Detail & Related papers (2021-05-08T21:24:09Z) - Which contributions count? Analysis of attribution in open source [0.0]
We characterize contributor acknowledgment models in open source by analyzing thousands of projects.
We find that community-generated systems of contribution acknowledgment make work like idea generation or bug finding more visible.
arXiv Detail & Related papers (2021-03-19T20:14:40Z) - KILT: a Benchmark for Knowledge Intensive Language Tasks [102.33046195554886]
We present a benchmark for knowledge-intensive language tasks (KILT)
All tasks in KILT are grounded in the same snapshot of Wikipedia.
We find that a shared dense vector index coupled with a seq2seq model is a strong baseline.
arXiv Detail & Related papers (2020-09-04T15:32:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.