Related papers: Revealing the value of Repository Centrality in lifespan prediction of Open Source Software Projects

Revealing the value of Repository Centrality in lifespan prediction of Open Source Software Projects

URL: http://arxiv.org/abs/2405.07508v1
Date: Mon, 13 May 2024 07:07:54 GMT
Title: Revealing the value of Repository Centrality in lifespan prediction of Open Source Software Projects
Authors: Runzhi He, Hengzhi Ye, Minghui Zhou,
Abstract summary: We propose a novel metric from the user-repository network, and leverage the metric to fit project deprecation predictors. We establish a comprehensive dataset containing 103,354 non-fork GitHub OSS projects spanning from 2011 to 2023. Our study reveals a correlation between the HITS centrality metrics and the repository deprecation risk.
Score: 5.438725298163702
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Background: Open Source Software is the building block of modern software. However, the prevalence of project deprecation in the open source world weakens the integrity of the downstream systems and the broad ecosystem. Therefore it calls for efforts in monitoring and predicting project deprecations, empowering stakeholders to take proactive measures. Challenge: Existing techniques mainly focus on static features on a point in time to make predictions, resulting in limited effects. Goal: We propose a novel metric from the user-repository network, and leverage the metric to fit project deprecation predictors and prove its real-life implications. Method: We establish a comprehensive dataset containing 103,354 non-fork GitHub OSS projects spanning from 2011 to 2023. We propose repository centrality, a family of HITS weights that captures shifts in the popularity of a repository in the repository-user star network. Further with the metric, we utilize the advancements in gradient boosting and deep learning to fit survival analysis models to predict project lifespan or its survival hazard. Results: Our study reveals a correlation between the HITS centrality metrics and the repository deprecation risk. A drop in the HITS weights of a repository indicates a decline in its centrality and prevalence, leading to an increase in its deprecation risk and a decrease in its expected lifespan. Our predictive models powered by repository centrality and other repository features achieve satisfactory accuracy on the test set, with repository centrality being the most significant feature among all. Implications: This research offers a novel perspective on understanding the effect of prevalence on the deprecation of OSS repositories. Our approach to predict repository deprecation help detect health status of project and take actions in advance, fostering a more resilient OSS ecosystem.

Related papers

An Empirical Validation of Open Source Repository Stability Metrics [5.69361786082969]
We present the first empirical validation of the proposed Composite Stability Index (CSI) by experimenting with 100 highly ranked GitHub repositories.<n>Our results suggest that (1) sampling weekly commit frequency pattern instead of daily is a more feasible measure of commit frequency stability across repositories.<n>These findings both confirm the viability of a control-theoretic lens on open-source health and provide concrete, evidence-backed applications for real-world project monitoring tools.
arXiv Detail & Related papers (2025-08-02T13:14:10Z)
Predicting Maintenance Cessation of Open Source Software Repositories with An Integrated Feature Framework [14.346295005927347]
Maintenance risks of open source software (OSS) projects pose significant threats to the quality, security, and resilience of modern software supply chains.<n>We introduce maintenance cessation'', grounded in explicit archival status and rigorous semantic analysis of project documentation.<n>We propose an integrated, multi-perspective feature framework for predicting maintenance cessation, systematically combining user-centric features, maintainer-centric features and project evolution features.
arXiv Detail & Related papers (2025-07-29T10:45:24Z)
Forecasting the risk of software choices: A model to foretell security vulnerabilities from library dependencies and source code evolution [4.538870924201896]
We introduce a model capable of vulnerability forecasting at library level. Our model can estimate the probability that a software project faces a CVE disclosure in a future time window.
arXiv Detail & Related papers (2024-11-17T23:36:27Z)
QBI: Quantile-Based Bias Initialization for Efficient Private Data Reconstruction in Federated Learning [0.5497663232622965]
Federated learning enables the training of machine learning models on distributed data without compromising user privacy. Recent research has shown that the central entity can perfectly reconstruct private data from shared model updates.
arXiv Detail & Related papers (2024-06-26T20:19:32Z)
Toward Reliable Human Pose Forecasting with Uncertainty [51.628234388046195]
We develop an open-source library for human pose forecasting, including multiple models, supporting several datasets. We devise two types of uncertainty in the problem to increase performance and convey better trust.
arXiv Detail & Related papers (2023-04-13T17:56:08Z)
EvCenterNet: Uncertainty Estimation for Object Detection using Evidential Learning [26.535329379980094]
EvCenterNet is a novel uncertainty-aware 2D object detection framework. We employ evidential learning to estimate both classification and regression uncertainties. We train our model on the KITTI dataset and evaluate it on challenging out-of-distribution datasets.
arXiv Detail & Related papers (2023-03-06T11:07:11Z)
Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization [52.7137956951533]
We argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. We introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions.
arXiv Detail & Related papers (2022-02-14T16:42:16Z)
Uncertainty Modeling for Out-of-Distribution Generalization [56.957731893992495]
We argue that the feature statistics can be properly manipulated to improve the generalization ability of deep learning models. Common methods often consider the feature statistics as deterministic values measured from the learned features. We improve the network generalization ability by modeling the uncertainty of domain shifts with synthesized feature statistics during training.
arXiv Detail & Related papers (2022-02-08T16:09:12Z)
Learning to Predict Trustworthiness with Steep Slope Loss [69.40817968905495]
We study the problem of predicting trustworthiness on real-world large-scale datasets. We observe that the trustworthiness predictors trained with prior-art loss functions are prone to view both correct predictions and incorrect predictions to be trustworthy. We propose a novel steep slope loss to separate the features w.r.t. correct predictions from the ones w.r.t. incorrect predictions by two slide-like curves that oppose each other.
arXiv Detail & Related papers (2021-09-30T19:19:09Z)
Predicting the Number of Reported Bugs in a Software Repository [0.0]
We examine eight different time series forecasting models, including Long Short Term Memory Neural Networks (LSTM), auto-regressive integrated moving average (ARIMA), and Random Forest Regressor. We analyze the quality of long-term prediction for each model based on different performance metrics. The assessment is conducted on Mozilla, which is a large open-source software application.
arXiv Detail & Related papers (2021-04-24T19:06:35Z)
Moving from Cross-Project Defect Prediction to Heterogeneous Defect Prediction: A Partial Replication Study [0.0]
Earlier studies often used machine learning techniques to build, validate, and improve bug prediction models. Knowledge coming from those models will not be overlapping to a target project if no sufficient metrics have been collected in the source projects. We systematically integrated Heterogeneous Defect Prediction (HDP) by replicating and validating the obtained results. Our results shed light on the infeasibility of many cases for the HDP algorithm due to its sensitivity to the parameter selection.
arXiv Detail & Related papers (2021-03-05T06:29:45Z)
UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model. UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data. We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD) UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)
Learning Output Embeddings in Structured Prediction [73.99064151691597]
A powerful and flexible approach to structured prediction consists in embedding the structured objects to be predicted into a feature space of possibly infinite dimension. A prediction in the original space is computed by solving a pre-image problem. In this work, we propose to jointly learn a finite approximation of the output embedding and the regression function into the new feature space.
arXiv Detail & Related papers (2020-07-29T09:32:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.