Predicting Abandonment of Open Source Software Projects with An Integrated Feature Framework
- URL: http://arxiv.org/abs/2507.21678v2
- Date: Wed, 29 Oct 2025 15:15:46 GMT
- Title: Predicting Abandonment of Open Source Software Projects with An Integrated Feature Framework
- Authors: Yiming Xu, Runzhi He, Hengzhi Ye, Minghui Zhou, Huaimin Wang,
- Abstract summary: Open Source Software (OSS) is a cornerstone of contemporary software development, yet the increasing prevalence of OSS project abandonment threatens global supply chains.<n>In this work, we address these challenges by reliably detecting OSS project abandonment through a dual approach: explicit archival status and rigorous semantic analysis of project documentation or description.<n>Building on this foundation, we introduce an integrated, multi-perspective feature framework for abandonment prediction, capturing user-centric, maintainer-centric, and project evolution features.<n>Our work provides reproducible, scalable, and practitioner-oriented support for understanding and managing abandonment risk in large OSS ecosystems.
- Score: 15.50732865972961
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Open Source Software (OSS) is a cornerstone of contemporary software development, yet the increasing prevalence of OSS project abandonment threatens global software supply chains. Although previous research has explored abandonment prediction methods, these methods often demonstrate unsatisfactory predictive performance, further plagued by imprecise abandonment discrimination, limited interpretability, and a lack of large, generalizable datasets. In this work, we address these challenges by reliably detecting OSS project abandonment through a dual approach: explicit archival status and rigorous semantic analysis of project documentation or description. Leveraging a precise and scalable labeling pipeline, we curate a comprehensive longitudinal dataset of 115,466 GitHub repositories, encompassing 57,733 confirmed abandonment repositories, enriched with detailed, timeline-based behavioral features. Building on this foundation, we introduce an integrated, multi-perspective feature framework for abandonment prediction, capturing user-centric, maintainer-centric, and project evolution features. Survival analysis using an AFT model yields a high C-index of 0.846, substantially outperforming models confined to surface features. Further, feature ablation and SHAP analyses confirm both the predictive power and interpretability of our approach. We further demonstrate practical deployment of a GBSA classifier for package risk in openEuler. By unifying precise labeling, multi-perspective features, and interpretable modeling, our work provides reproducible, scalable, and practitioner-oriented support for understanding and managing abandonment risk in large OSS ecosystems. Our tool not only predicts abandonment but also enhances program comprehension by providing actionable insights into the health and sustainability of OSS projects.
Related papers
- From Passive Metric to Active Signal: The Evolving Role of Uncertainty Quantification in Large Language Models [77.04403907729738]
This survey charts the evolution of uncertainty from a passive diagnostic metric to an active control signal guiding real-time model behavior.<n>We demonstrate how uncertainty is leveraged as an active control signal across three frontiers.<n>This survey argues that mastering the new trend of uncertainty is essential for building the next generation of scalable, reliable, and trustworthy AI.
arXiv Detail & Related papers (2026-01-22T06:21:31Z) - Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs [50.075587392477935]
We conduct the first large-scale empirical study of 705 real-world failures from the open-source DeepSeek, Llama, and Qwen ecosystems.<n>Our analysis reveals a paradigm shift: white-box orchestration relocates the reliability bottleneck from model algorithmic defects to the systemic fragility of the deployment stack.
arXiv Detail & Related papers (2026-01-20T06:42:56Z) - Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey [59.3507264893654]
Issue resolution is a complex Software Engineering task integral to real-world development.<n> benchmarks like SWE-bench revealed this task as profoundly difficult for large language models.<n>This paper presents a systematic survey of this emerging domain.
arXiv Detail & Related papers (2026-01-15T18:55:03Z) - Bridging Semantics & Structure for Software Vulnerability Detection using Hybrid Network Models [0.0]
We capture control- and data-flow relations as complex interaction networks.<n>Our framework combines graph representations with light-weight (4B) local LLMs.<n>Our method achieves 93.57% accuracy-an 8.36% gain over Graph Attention Network-based embeddings.
arXiv Detail & Related papers (2025-10-11T19:32:00Z) - An Accurate and Efficient Vulnerability Propagation Analysis Framework [13.051314477680902]
We propose a novel approach to quantify the scope and evolution of vulnerability impacts in software supply chains.<n>We implement a prototype of our approach in the Java Maven ecosystem and evaluate it on 100 real-world vulnerabilities.
arXiv Detail & Related papers (2025-06-02T05:55:45Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Reasoning with LLMs for Zero-Shot Vulnerability Detection [0.9208007322096533]
We present textbfVulnSage, a comprehensive evaluation framework and a curated dataset from diverse, large-scale open-source system software projects.<n>The framework supports multi-granular analysis across function, file, and inter-function levels.<n>It employs four diverse zero-shot prompt strategies: Baseline, Chain-of-context, Think, and Think & verify.
arXiv Detail & Related papers (2025-03-22T23:59:17Z) - SurvHive: a package to consistently access multiple survival-analysis packages [0.0]
SurvHive is a Python-based framework designed to unify survival analysis methods within a coherent and interface modeled on scikit-learn.<n>SurvHive integrates classical statistical models with cutting-edge deep learning approaches, including transformer-based architectures and parametric survival models.
arXiv Detail & Related papers (2025-02-04T11:02:40Z) - Forecasting the risk of software choices: A model to foretell security vulnerabilities from library dependencies and source code evolution [4.538870924201896]
We introduce a model capable of vulnerability forecasting at library level.
Our model can estimate the probability that a software project faces a CVE disclosure in a future time window.
arXiv Detail & Related papers (2024-11-17T23:36:27Z) - Trustworthiness in Retrieval-Augmented Generation Systems: A Survey [59.26328612791924]
Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs)
We propose a unified framework that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy.
arXiv Detail & Related papers (2024-09-16T09:06:44Z) - Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding [49.973156959947346]
Existing Video Temporal Grounding (VTG) models excel in accuracy but often overlook open-world challenges posed by open-vocabulary queries and untrimmed videos.
We introduce a robust network module that benefits from a two-stage cross-modal alignment task.
It integrates Deep Evidential Regression (DER) to explicitly and thoroughly quantify uncertainty during training.
In response, we develop a simple yet effective Geom-regularizer that enhances the uncertainty learning framework from the ground up.
arXiv Detail & Related papers (2024-08-29T05:32:03Z) - Revealing the value of Repository Centrality in lifespan prediction of Open Source Software Projects [5.438725298163702]
We propose a novel metric from the user-repository network, and leverage the metric to fit project deprecation predictors.
We establish a comprehensive dataset containing 103,354 non-fork GitHub OSS projects spanning from 2011 to 2023.
Our study reveals a correlation between the HITS centrality metrics and the repository deprecation risk.
arXiv Detail & Related papers (2024-05-13T07:07:54Z) - Profile of Vulnerability Remediations in Dependencies Using Graph
Analysis [40.35284812745255]
This research introduces graph analysis methods and a modified Graph Attention Convolutional Neural Network (GAT) model.
We analyze control flow graphs to profile breaking changes in applications occurring from dependency upgrades intended to remediate vulnerabilities.
Results demonstrate the effectiveness of the enhanced GAT model in offering nuanced insights into the relational dynamics of code vulnerabilities.
arXiv Detail & Related papers (2024-03-08T02:01:47Z) - Distributionally Robust Statistical Verification with Imprecise Neural Networks [3.9456691693452552]
A particularly challenging problem in AI safety is providing guarantees on the behavior of high-dimensional autonomous systems.<n>This paper proposes a novel approach based on uncertainty quantification using concepts from probabilities.<n>We show that our approach can provide useful and scalable guarantees for high-dimensional systems.
arXiv Detail & Related papers (2023-08-28T18:06:24Z) - IRJIT: A Simple, Online, Information Retrieval Approach for Just-In-Time Software Defect Prediction [10.084626547964389]
Just-in-Time software defect prediction (JIT-SDP) prevents the introduction of defects into the software by identifying them at commit check-in time.
Current software defect prediction approaches rely on manually crafted features such as change metrics and involve expensive to train machine learning or deep learning models.
We propose an approach called IRJIT that employs information retrieval on source code and labels new commits as buggy or clean based on their similarity to past buggy or clean commits.
arXiv Detail & Related papers (2022-10-05T17:54:53Z) - Recursively Feasible Probabilistic Safe Online Learning with Control Barrier Functions [60.26921219698514]
We introduce a model-uncertainty-aware reformulation of CBF-based safety-critical controllers.
We then present the pointwise feasibility conditions of the resulting safety controller.
We use these conditions to devise an event-triggered online data collection strategy.
arXiv Detail & Related papers (2022-08-23T05:02:09Z) - Is Vertical Logistic Regression Privacy-Preserving? A Comprehensive
Privacy Analysis and Beyond [57.10914865054868]
We consider vertical logistic regression (VLR) trained with mini-batch descent gradient.
We provide a comprehensive and rigorous privacy analysis of VLR in a class of open-source Federated Learning frameworks.
arXiv Detail & Related papers (2022-07-19T05:47:30Z) - Learning Output Embeddings in Structured Prediction [73.99064151691597]
A powerful and flexible approach to structured prediction consists in embedding the structured objects to be predicted into a feature space of possibly infinite dimension.
A prediction in the original space is computed by solving a pre-image problem.
In this work, we propose to jointly learn a finite approximation of the output embedding and the regression function into the new feature space.
arXiv Detail & Related papers (2020-07-29T09:32:53Z) - An Uncertainty-based Human-in-the-loop System for Industrial Tool Wear
Analysis [68.8204255655161]
We show that uncertainty measures based on Monte-Carlo dropout in the context of a human-in-the-loop system increase the system's transparency and performance.
A simulation study demonstrates that the uncertainty-based human-in-the-loop system increases performance for different levels of human involvement.
arXiv Detail & Related papers (2020-07-14T15:47:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.