Efficient Story Point Estimation With Comparative Learning
- URL: http://arxiv.org/abs/2507.14642v1
- Date: Sat, 19 Jul 2025 14:36:19 GMT
- Title: Efficient Story Point Estimation With Comparative Learning
- Authors: Monoshiz Mahbub Khan, Xioayin Xi, Andrew Meneely, Zhe Yu,
- Abstract summary: Story point estimation is an essential part of agile software development.<n>Traditionally, developers estimate story points collaboratively using planning poker or other manual techniques.<n>Machine learning can reduce this burden, but only with enough context from the historical decisions made by the project team.
- Score: 4.2396247450279185
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Story point estimation is an essential part of agile software development. Story points are unitless, project-specific effort estimates that help developers plan their sprints. Traditionally, developers estimate story points collaboratively using planning poker or other manual techniques. While the initial calibrating of the estimates to each project is helpful, once a team has converged on a set of precedents, story point estimation can become tedious and labor-intensive. Machine learning can reduce this burden, but only with enough context from the historical decisions made by the project team. That is, state-of-the-art models, such as GPT2SP and FastText-SVM, only make accurate predictions (within-project) when trained on data from the same project. The goal of this work is to streamline story point estimation by evaluating a comparative learning-based framework for calibrating project-specific story point prediction models. Instead of assigning a specific story point value to every backlog item, developers are presented with pairs of items, and indicate which item requires more effort. Using these comparative judgments, a machine learning model is trained to predict the story point estimates. We empirically evaluated our technique using data with 23,313 manual estimates in 16 projects. The model learned from comparative judgments can achieve on average 0.34 Spearman's rank correlation coefficient between its predictions and the ground truth story points. This is similar to, if not better than, the performance of a regression model learned from the ground truth story points. Therefore, the proposed comparative learning approach is more efficient than state-of-the-art regression-based approaches according to the law of comparative judgments - providing comparative judgments yields a lower cognitive burden on humans than providing ratings or categorical labels.
Related papers
- How to Select Datapoints for Efficient Human Evaluation of NLG Models? [57.60407340254572]
We develop and analyze a suite of selectors to get the most informative datapoints for human evaluation.<n>We show that selectors based on variance in automated metric scores, diversity in model outputs, or Item Response Theory outperform random selection.<n>In particular, we introduce source-based estimators, which predict item usefulness for human evaluation just based on the source texts.
arXiv Detail & Related papers (2025-01-30T10:33:26Z) - SureMap: Simultaneous Mean Estimation for Single-Task and Multi-Task Disaggregated Evaluation [75.56845750400116]
Disaggregated evaluation -- estimation of performance of a machine learning model on different subpopulations -- is a core task when assessing performance and group-fairness of AI systems.
We develop SureMap that has high estimation accuracy for both multi-task and single-task disaggregated evaluations of blackbox models.
Our method combines maximum a posteriori (MAP) estimation using a well-chosen prior together with cross-validation-free tuning via Stein's unbiased risk estimate (SURE)
arXiv Detail & Related papers (2024-11-14T17:53:35Z) - Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context.<n>We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters.<n>We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z) - Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review.
A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods.
We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z) - A Framework for Efficient Model Evaluation through Stratification, Sampling, and Estimation [17.351089059392674]
We propose a framework for model evaluation that includes stratification, sampling, and estimation components.
We show that stratification via k-means clustering based on accurate predictions of model performance yields efficient estimators.
We also find that model-assisted estimators, which leverage predictions of model accuracy on the unlabeled portion of the dataset, are generally more efficient than the traditional estimates.
arXiv Detail & Related papers (2024-06-11T14:49:04Z) - Team-related Features in Code Review Prediction Models [10.576931077314887]
We evaluate the prediction power of features related to code ownership, workload, and team relationship.
Our results show that, individually, features related to code ownership have the best prediction power.
We conclude that all proposed features together with lines of code can make the best predictions for both reviewer participation and amount of feedback.
arXiv Detail & Related papers (2023-12-11T09:30:09Z) - ASPEST: Bridging the Gap Between Active Learning and Selective
Prediction [56.001808843574395]
Selective prediction aims to learn a reliable model that abstains from making predictions when uncertain.
Active learning aims to lower the overall labeling effort, and hence human dependence, by querying the most informative examples.
In this work, we introduce a new learning paradigm, active selective prediction, which aims to query more informative samples from the shifted target domain.
arXiv Detail & Related papers (2023-04-07T23:51:07Z) - An Empirical Study on Data Leakage and Generalizability of Link
Prediction Models for Issues and Commits [7.061740334417124]
LinkFormer preserves and improves the accuracy of existing predictions.
Our findings support that to simulate real-world scenarios effectively, researchers must maintain the temporal flow of data.
arXiv Detail & Related papers (2022-11-01T10:54:26Z) - Heterogeneous Graph Neural Networks for Software Effort Estimation [2.652428960991066]
Current approaches for automatically estimating story points focus on applying pre-trained embedding models and deep learning for text regression.
We propose HeteroSP, a tool for estimating story points from textual input of Agile software project issues.
arXiv Detail & Related papers (2022-06-22T12:46:02Z) - Learning to be a Statistician: Learned Estimator for Number of Distinct
Values [54.629042119819744]
Estimating the number of distinct values (NDV) in a column is useful for many tasks in database systems.
In this work, we focus on how to derive accurate NDV estimations from random (online/offline) samples.
We propose to formulate the NDV estimation task in a supervised learning framework, and aim to learn a model as the estimator.
arXiv Detail & Related papers (2022-02-06T15:42:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.