Revealing the value of Repository Centrality in lifespan prediction of Open Source Software Projects
- URL: http://arxiv.org/abs/2405.07508v1
- Date: Mon, 13 May 2024 07:07:54 GMT
- Title: Revealing the value of Repository Centrality in lifespan prediction of Open Source Software Projects
- Authors: Runzhi He, Hengzhi Ye, Minghui Zhou,
- Abstract summary: We propose a novel metric from the user-repository network, and leverage the metric to fit project deprecation predictors.
We establish a comprehensive dataset containing 103,354 non-fork GitHub OSS projects spanning from 2011 to 2023.
Our study reveals a correlation between the HITS centrality metrics and the repository deprecation risk.
- Score: 5.438725298163702
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: Open Source Software is the building block of modern software. However, the prevalence of project deprecation in the open source world weakens the integrity of the downstream systems and the broad ecosystem. Therefore it calls for efforts in monitoring and predicting project deprecations, empowering stakeholders to take proactive measures. Challenge: Existing techniques mainly focus on static features on a point in time to make predictions, resulting in limited effects. Goal: We propose a novel metric from the user-repository network, and leverage the metric to fit project deprecation predictors and prove its real-life implications. Method: We establish a comprehensive dataset containing 103,354 non-fork GitHub OSS projects spanning from 2011 to 2023. We propose repository centrality, a family of HITS weights that captures shifts in the popularity of a repository in the repository-user star network. Further with the metric, we utilize the advancements in gradient boosting and deep learning to fit survival analysis models to predict project lifespan or its survival hazard. Results: Our study reveals a correlation between the HITS centrality metrics and the repository deprecation risk. A drop in the HITS weights of a repository indicates a decline in its centrality and prevalence, leading to an increase in its deprecation risk and a decrease in its expected lifespan. Our predictive models powered by repository centrality and other repository features achieve satisfactory accuracy on the test set, with repository centrality being the most significant feature among all. Implications: This research offers a novel perspective on understanding the effect of prevalence on the deprecation of OSS repositories. Our approach to predict repository deprecation help detect health status of project and take actions in advance, fostering a more resilient OSS ecosystem.
Related papers
- QBI: Quantile-based Bias Initialization for Efficient Private Data Reconstruction in Federated Learning [0.5497663232622965]
Federated learning enables the training of machine learning models on distributed data without compromising user privacy.
Recent research has shown that the central entity can perfectly reconstruct private data from shared model updates.
arXiv Detail & Related papers (2024-06-26T20:19:32Z) - Toward Reliable Human Pose Forecasting with Uncertainty [51.628234388046195]
We develop an open-source library for human pose forecasting, including multiple models, supporting several datasets.
We devise two types of uncertainty in the problem to increase performance and convey better trust.
arXiv Detail & Related papers (2023-04-13T17:56:08Z) - EvCenterNet: Uncertainty Estimation for Object Detection using
Evidential Learning [26.535329379980094]
EvCenterNet is a novel uncertainty-aware 2D object detection framework.
We employ evidential learning to estimate both classification and regression uncertainties.
We train our model on the KITTI dataset and evaluate it on challenging out-of-distribution datasets.
arXiv Detail & Related papers (2023-03-06T11:07:11Z) - Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient
for Out-of-Distribution Generalization [52.7137956951533]
We argue that devising simpler methods for learning predictors on existing features is a promising direction for future research.
We introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift.
Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions.
arXiv Detail & Related papers (2022-02-14T16:42:16Z) - Uncertainty Modeling for Out-of-Distribution Generalization [56.957731893992495]
We argue that the feature statistics can be properly manipulated to improve the generalization ability of deep learning models.
Common methods often consider the feature statistics as deterministic values measured from the learned features.
We improve the network generalization ability by modeling the uncertainty of domain shifts with synthesized feature statistics during training.
arXiv Detail & Related papers (2022-02-08T16:09:12Z) - Evaluating Predictive Distributions: Does Bayesian Deep Learning Work? [45.290773422944866]
Posterior predictive distributions quantify uncertainties ignored by point estimates.
This paper introduces textitThe Neural Testbed, which provides tools for the systematic evaluation of agents that generate such predictions.
arXiv Detail & Related papers (2021-10-09T18:54:02Z) - Learning to Predict Trustworthiness with Steep Slope Loss [69.40817968905495]
We study the problem of predicting trustworthiness on real-world large-scale datasets.
We observe that the trustworthiness predictors trained with prior-art loss functions are prone to view both correct predictions and incorrect predictions to be trustworthy.
We propose a novel steep slope loss to separate the features w.r.t. correct predictions from the ones w.r.t. incorrect predictions by two slide-like curves that oppose each other.
arXiv Detail & Related papers (2021-09-30T19:19:09Z) - Predicting the Number of Reported Bugs in a Software Repository [0.0]
We examine eight different time series forecasting models, including Long Short Term Memory Neural Networks (LSTM), auto-regressive integrated moving average (ARIMA), and Random Forest Regressor.
We analyze the quality of long-term prediction for each model based on different performance metrics.
The assessment is conducted on Mozilla, which is a large open-source software application.
arXiv Detail & Related papers (2021-04-24T19:06:35Z) - Moving from Cross-Project Defect Prediction to Heterogeneous Defect
Prediction: A Partial Replication Study [0.0]
Earlier studies often used machine learning techniques to build, validate, and improve bug prediction models.
Knowledge coming from those models will not be overlapping to a target project if no sufficient metrics have been collected in the source projects.
We systematically integrated Heterogeneous Defect Prediction (HDP) by replicating and validating the obtained results.
Our results shed light on the infeasibility of many cases for the HDP algorithm due to its sensitivity to the parameter selection.
arXiv Detail & Related papers (2021-03-05T06:29:45Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - Learning Output Embeddings in Structured Prediction [73.99064151691597]
A powerful and flexible approach to structured prediction consists in embedding the structured objects to be predicted into a feature space of possibly infinite dimension.
A prediction in the original space is computed by solving a pre-image problem.
In this work, we propose to jointly learn a finite approximation of the output embedding and the regression function into the new feature space.
arXiv Detail & Related papers (2020-07-29T09:32:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.