Forecasting the Maintained Score from the OpenSSF Scorecard for GitHub Repositories linked to PyPI libraries
- URL: http://arxiv.org/abs/2601.18344v1
- Date: Mon, 26 Jan 2026 10:32:54 GMT
- Title: Forecasting the Maintained Score from the OpenSSF Scorecard for GitHub Repositories linked to PyPI libraries
- Authors: Alexandros Tsakpinis, Efe Berk Ergülec, Emil Schwenger, Alexander Pretschner,
- Abstract summary: We study to what extent future maintenance activity, as captured by the OpenSSF maintained score, can be forecasted.<n>We analyze 3,220 GitHub repositories associated with the top 1% most central PyPI libraries by PageRank.<n>Our results show that future maintenance activity can be predicted with meaningful accuracy.
- Score: 78.48200143057376
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The OpenSSF Scorecard is widely used to assess the security posture of open-source software repositories, with the Maintained metric indicating recent development activity and helping identify potentially abandoned dependencies. However, this metric is inherently retrospective, reflecting only the past 90 days of activity and providing no insight into future maintenance, which limits its usefulness for proactive risk assessment. In this paper, we study to what extent future maintenance activity, as captured by the OpenSSF Maintained score, can be forecasted. We analyze 3,220 GitHub repositories associated with the top 1% most central PyPI libraries by PageRank and reconstruct historical Maintained scores over a three-year period. We formulate the task as multivariate time series forecasting and consider four target representations: raw scores, bucketed maintenance levels, numerical trend slopes, and categorical trend types. We compare a statistical model (VARMA), a machine learning model (Random Forest), and a deep learning model (LSTM) across training windows of 3-12 months and forecasting horizons of 1-6 months. Our results show that future maintenance activity can be predicted with meaningful accuracy, particularly for aggregated representations such as bucketed scores and trend types, achieving accuracies above 0.95 and 0.80, respectively. Simpler statistical and machine learning models perform on par with deep learning approaches, indicating that complex architectures are not required. These findings suggest that predictive modeling can effectively complement existing Scorecard metrics, enabling more proactive assessment of open-source maintenance risks.
Related papers
- Prescriptive Scaling Reveals the Evolution of Language Model Capabilities [22.14002750185524]
We estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre training FLOPs.<n>We validate the temporal reliability by fitting on earlier model generations and evaluating on later releases.<n>We introduce an efficient algorithm that recovers near full data frontiers using roughly 20% of evaluation budget.
arXiv Detail & Related papers (2026-02-17T03:13:51Z) - Scaling Open-Ended Reasoning to Predict the Future [56.672065928345525]
We train language models to make predictions on open-ended forecasting questions.<n>To scale up training data, we synthesize novel forecasting questions from global events reported in daily news.<n>We find calibration improvements from forecasting training generalize across popular benchmarks.
arXiv Detail & Related papers (2025-12-31T18:59:51Z) - GIFT-Eval: A Benchmark For General Time Series Forecasting Model Evaluation [90.53485251837235]
Time series foundation models excel in zero-shot forecasting, handling diverse tasks without explicit training.
GIFT-Eval is a pioneering benchmark aimed at promoting evaluation across diverse datasets.
GIFT-Eval encompasses 23 datasets over 144,000 time series and 177 million data points.
arXiv Detail & Related papers (2024-10-14T11:29:38Z) - A Probabilistic Perspective on Unlearning and Alignment for Large Language Models [48.96686419141881]
We introduce the first formal probabilistic evaluation framework for Large Language Models (LLMs)<n> Namely, we propose novel metrics with high probability guarantees concerning the output distribution of a model.<n>Our metrics are application-independent and allow practitioners to make more reliable estimates about model capabilities before deployment.
arXiv Detail & Related papers (2024-10-04T15:44:23Z) - Revealing the value of Repository Centrality in lifespan prediction of Open Source Software Projects [5.438725298163702]
We propose a novel metric from the user-repository network, and leverage the metric to fit project deprecation predictors.
We establish a comprehensive dataset containing 103,354 non-fork GitHub OSS projects spanning from 2011 to 2023.
Our study reveals a correlation between the HITS centrality metrics and the repository deprecation risk.
arXiv Detail & Related papers (2024-05-13T07:07:54Z) - A Meta-Learning Approach to Predicting Performance and Data Requirements [163.4412093478316]
We propose an approach to estimate the number of samples required for a model to reach a target performance.
We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset.
We introduce a novel piecewise power law (PPL) that handles the two data differently.
arXiv Detail & Related papers (2023-03-02T21:48:22Z) - Firenze: Model Evaluation Using Weak Signals [5.723905680436377]
We introduce Firenze, a novel framework for comparative evaluation of machine learning models' performance.
We show that markers computed and combined over select subsets of samples called regions of interest can provide a robust estimate of their real-world performances.
arXiv Detail & Related papers (2022-07-02T13:20:38Z) - Short-term forecasting COVID-19 cumulative confirmed cases: Perspectives
for Brazil [3.0711362702464675]
The new Coronavirus (COVID-19) is an emerging disease responsible for infecting millions of people since the first notification until nowadays.
In this paper, autoregressive integrated moving average (ARIMA), cubist (CUBIST), random forest (RF), ridge regression (RIDGE), and stacking-ensemble learning are evaluated.
The developed models can generate accurate forecasting, achieving errors in a range of 0.87% - 3.51%, 1.02% - 5.63%, and 0.95% - 6.90% in one, three, and six-days-ahead, respectively.
arXiv Detail & Related papers (2020-07-21T17:58:58Z) - Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck.
We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network.
We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z) - Value-driven Hindsight Modelling [68.658900923595]
Value estimation is a critical component of the reinforcement learning (RL) paradigm.
Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function.
We develop an approach for representation learning in RL that sits in between these two extremes.
This provides tractable prediction targets that are directly relevant for a task, and can thus accelerate learning the value function.
arXiv Detail & Related papers (2020-02-19T18:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.