Moderately Mighty: To What Extent Can Internal Software Metrics Predict App Popularity at Launch?
- URL: http://arxiv.org/abs/2507.02110v3
- Date: Thu, 11 Sep 2025 19:10:13 GMT
- Title: Moderately Mighty: To What Extent Can Internal Software Metrics Predict App Popularity at Launch?
- Authors: Md Nahidul Islam Opu, Fatima Islam Mouri, Rick Kazman, Yuanfang Cai, Shaiful Chowdhury,
- Abstract summary: This study explores the extent to which internal software metrics, measurable from source code before deployment, can predict an app's popularity.<n>We constructed a rigorously filtered dataset of 446 open-source Java-based Android apps available on both F-Droid and Google Play Store.<n>We conclude that, although internal code metrics alone are insufficient for accurately predicting an app's future popularity, they do exhibit meaningful correlations with it.
- Score: 8.230804322052956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting a mobile app's popularity before its first release can provide developers with a strategic advantage in a competitive marketplace, yet it remains a challenging problem. This study explores the extent to which internal software metrics, measurable from source code before deployment, can predict an app's popularity (i.e., ratings and downloads per year) at inception. For our analysis, we constructed a rigorously filtered dataset of 446 open-source Java-based Android apps that are available on both F-Droid and Google Play Store. Using app source code from F-Droid, we extracted a wide array of internal metrics, including system-, class-, and method-level code metrics, code smells, and app metadata. Popularity-related information, including reviews and download counts, was collected from the Play Store. We evaluate regression and classification models across three feature sets: a minimal Size-only baseline, a domain-informed Handpicked set, and a Voting set derived via feature selection algorithms. Our results show that, for both app ratings and number of downloads, regression models perform poorly due to skewed rating distributions and a highly scattered range of download counts in our dataset. However, when reframed as a binary classification (Popular vs. Unpopular), performance improves significantly-the best model, a Multilayer Perceptron, achieves an F1-score of 0.72. We conclude that, although internal code metrics alone are insufficient for accurately predicting an app's future popularity, they do exhibit meaningful correlations with it. Thus, our findings challenge prior studies that have entirely dismissed internal metrics as valid indicators of software quality. Instead, our results align with research suggesting that internal code metrics can be valuable when evaluated within the appropriate context-specifically, we found them useful for classification tasks.
Related papers
- From App Features to Explanation Needs: Analyzing Correlations and Predictive Potential [2.2139415366377375]
This study investigates whether explanation needs, classified from user reviews, can be predicted based on app properties.<n>We analyzed a gold standard dataset of 4,495 app reviews enriched with metadata.
arXiv Detail & Related papers (2025-08-05T19:46:13Z) - MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools [54.63478102768333]
Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions.<n>We propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools.
arXiv Detail & Related papers (2025-04-28T18:06:38Z) - Prioritizing App Reviews for Developer Responses on Google Play [1.5771347525430772]
Since 2013, Google Play has allowed developers to respond to user reviews.<n>Only 13% to 18% of developers engage in this practice.<n>We propose a method to prioritize reviews based on response priority.
arXiv Detail & Related papers (2025-02-03T16:56:08Z) - What should an AI assessor optimise for? [57.96463917842822]
An AI assessor is an external, ideally indepen-dent system that predicts an indicator, e.g., a loss value, of another AI system.<n>Here we address the question: is it always optimal to train the assessor for the target metric?<n>We experimentally explore this question for, respectively, regression losses and classification scores with monotonic and non-monotonic mappings.
arXiv Detail & Related papers (2025-02-01T08:41:57Z) - Robust Collaborative Filtering to Popularity Distribution Shift [56.78171423428719]
We present a simple yet effective debiasing strategy, PopGo, which quantifies and reduces the interaction-wise popularity shortcut without assumptions on the test data.
On both ID and OOD test sets, PopGo achieves significant gains over the state-of-the-art debiasing strategies.
arXiv Detail & Related papers (2023-10-16T04:20:52Z) - Revisiting Android App Categorization [5.805764439228492]
We present a comprehensive evaluation of existing Android app categorization approaches using our new ground-truth dataset.
We propose two innovative approaches that effectively outperform the performance of existing methods in both description-based and APK-based methodologies.
arXiv Detail & Related papers (2023-10-11T08:25:34Z) - Proactive Prioritization of App Issues via Contrastive Learning [2.6763498831034043]
We propose a new framework, PPrior, that enables proactive prioritization of app issues through identifying prominent reviews.
PPrior employs a pre-trained T5 model and works in three phases.
Phase one adapts the pre-trained T5 model to the user reviews data in a self-supervised fashion.
Phase two, we leverage contrastive training to learn a generic and task-independent representation of user reviews.
arXiv Detail & Related papers (2023-03-12T06:23:10Z) - A novel evaluation methodology for supervised Feature Ranking algorithms [0.0]
This paper proposes a new evaluation methodology for Feature Rankers.
By making use of synthetic datasets, feature importance scores can be known beforehand, allowing more systematic evaluation.
To facilitate large-scale experimentation using the new methodology, a benchmarking framework was built in Python, called fseval.
arXiv Detail & Related papers (2022-07-09T12:00:36Z) - Integrating Rankings into Quantized Scores in Peer Review [61.27794774537103]
In peer review, reviewers are usually asked to provide scores for the papers.
To mitigate this issue, conferences have started to ask reviewers to additionally provide a ranking of the papers they have reviewed.
There are no standard procedure for using this ranking information and Area Chairs may use it in different ways.
We take a principled approach to integrate the ranking information into the scores.
arXiv Detail & Related papers (2022-04-05T19:39:13Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - AppQ: Warm-starting App Recommendation Based on View Graphs [37.37177133951606]
New apps often have few (or even no) user feedback, suffering from the classic cold-start problem.
Here, a fundamental requirement is the capability to accurately measure an app's quality based on its inborn features, rather than user-generated features.
We propose AppQ, a novel app quality grading and recommendation system that extracts inborn features of apps based on app source code.
arXiv Detail & Related papers (2021-09-08T17:40:48Z) - Evaluating Pre-Trained Models for User Feedback Analysis in Software
Engineering: A Study on Classification of App-Reviews [2.66512000865131]
We study the accuracy and time efficiency of pre-trained neural language models (PTMs) for app review classification.
We set up different studies to evaluate PTMs in multiple settings.
In all cases, Micro and Macro Precision, Recall, and F1-scores will be used.
arXiv Detail & Related papers (2021-04-12T23:23:45Z) - Evaluating Large-Vocabulary Object Detectors: The Devil is in the
Details [107.2722027807328]
We find that the default implementation of AP is neither category independent, nor does it directly reward properly calibrated detectors.
We show that the default implementation produces a gameable metric, where a simple, nonsensical re-ranking policy can improve AP by a large margin.
We benchmark recent advances in large-vocabulary detection and find that many reported gains do not translate to improvements under our new per-class independent evaluation.
arXiv Detail & Related papers (2021-02-01T18:56:02Z) - PiRank: Learning To Rank via Differentiable Sorting [85.28916333414145]
We propose PiRank, a new class of differentiable surrogates for ranking.
We show that PiRank exactly recovers the desired metrics in the limit of zero temperature.
arXiv Detail & Related papers (2020-12-12T05:07:36Z) - Emerging App Issue Identification via Online Joint Sentiment-Topic
Tracing [66.57888248681303]
We propose a novel emerging issue detection approach named MERIT.
Based on the AOBST model, we infer the topics negatively reflected in user reviews for one app version.
Experiments on popular apps from Google Play and Apple's App Store demonstrate the effectiveness of MERIT.
arXiv Detail & Related papers (2020-08-23T06:34:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.