Can Internal Software Metrics Predict App Popularity at Launch? Yeas! and Nays!
- URL: http://arxiv.org/abs/2507.02110v1
- Date: Wed, 02 Jul 2025 19:47:41 GMT
- Title: Can Internal Software Metrics Predict App Popularity at Launch? Yeas! and Nays!
- Authors: Md Nahidul Islam Opu, Fatima Islam Mouri, Rick Kazman, Yuanfang Cai, Shaiful Chowdhury,
- Abstract summary: Internal software metrics, measurable from source code before deployment, can predict an app's popularity.<n>This study uses a dataset of 446 open-source Android apps from F-Droid.
- Score: 10.416999845729054
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Predicting mobile app popularity before release can provide developers with a strategic advantage in a competitive marketplace, yet it remains a challenging problem. This study explores whether internal software metrics, measurable from source code before deployment, can predict an app's popularity, defined by user ratings (calculated from user reviews) and DownloadsPerYear (yearly downloads). Using a dataset of 446 open-source Android apps from F-Droid, we extract a wide array of features, including system-, class-, and method-level code metrics, code smells, and app metadata. Additional information, such as user reviews, download counts, and uses-permission, was collected from the Google Play Store. We evaluate regression and classification models across three feature sets: a minimal Size-only baseline, a domain-informed Handpicked set, and a Voting set derived via feature selection algorithms. Regression models perform poorly due to skewed data, with low $R^2$ scores. However, when reframed as binary classification (Popular vs. Unpopular), results improve significantly. The best model, a Multilayer Perceptron using the Voting set, achieves F1-scores of 0.72. These results suggest that internal code metrics, although limited in their explanatory power, can serve as useful indicators of app popularity. This challenges earlier findings that dismissed internal metrics as predictors of software quality.
Related papers
- From App Features to Explanation Needs: Analyzing Correlations and Predictive Potential [2.2139415366377375]
This study investigates whether explanation needs, classified from user reviews, can be predicted based on app properties.<n>We analyzed a gold standard dataset of 4,495 app reviews enriched with metadata.
arXiv Detail & Related papers (2025-08-05T19:46:13Z) - MICE for CATs: Model-Internal Confidence Estimation for Calibrating Agents with Tools [54.63478102768333]
Well-calibrated model confidences can be used to weigh the risk versus reward of potential actions.<n>We propose a novel class of model-internal confidence estimators (MICE) to better assess confidence when calling tools.
arXiv Detail & Related papers (2025-04-28T18:06:38Z) - Prioritizing App Reviews for Developer Responses on Google Play [1.5771347525430772]
Since 2013, Google Play has allowed developers to respond to user reviews.<n>Only 13% to 18% of developers engage in this practice.<n>We propose a method to prioritize reviews based on response priority.
arXiv Detail & Related papers (2025-02-03T16:56:08Z) - What should an AI assessor optimise for? [57.96463917842822]
An AI assessor is an external, ideally indepen-dent system that predicts an indicator, e.g., a loss value, of another AI system.<n>Here we address the question: is it always optimal to train the assessor for the target metric?<n>We experimentally explore this question for, respectively, regression losses and classification scores with monotonic and non-monotonic mappings.
arXiv Detail & Related papers (2025-02-01T08:41:57Z) - Robust Collaborative Filtering to Popularity Distribution Shift [56.78171423428719]
We present a simple yet effective debiasing strategy, PopGo, which quantifies and reduces the interaction-wise popularity shortcut without assumptions on the test data.
On both ID and OOD test sets, PopGo achieves significant gains over the state-of-the-art debiasing strategies.
arXiv Detail & Related papers (2023-10-16T04:20:52Z) - Revisiting Android App Categorization [5.805764439228492]
We present a comprehensive evaluation of existing Android app categorization approaches using our new ground-truth dataset.
We propose two innovative approaches that effectively outperform the performance of existing methods in both description-based and APK-based methodologies.
arXiv Detail & Related papers (2023-10-11T08:25:34Z) - Proactive Prioritization of App Issues via Contrastive Learning [2.6763498831034043]
We propose a new framework, PPrior, that enables proactive prioritization of app issues through identifying prominent reviews.
PPrior employs a pre-trained T5 model and works in three phases.
Phase one adapts the pre-trained T5 model to the user reviews data in a self-supervised fashion.
Phase two, we leverage contrastive training to learn a generic and task-independent representation of user reviews.
arXiv Detail & Related papers (2023-03-12T06:23:10Z) - A novel evaluation methodology for supervised Feature Ranking algorithms [0.0]
This paper proposes a new evaluation methodology for Feature Rankers.
By making use of synthetic datasets, feature importance scores can be known beforehand, allowing more systematic evaluation.
To facilitate large-scale experimentation using the new methodology, a benchmarking framework was built in Python, called fseval.
arXiv Detail & Related papers (2022-07-09T12:00:36Z) - Integrating Rankings into Quantized Scores in Peer Review [61.27794774537103]
In peer review, reviewers are usually asked to provide scores for the papers.
To mitigate this issue, conferences have started to ask reviewers to additionally provide a ranking of the papers they have reviewed.
There are no standard procedure for using this ranking information and Area Chairs may use it in different ways.
We take a principled approach to integrate the ranking information into the scores.
arXiv Detail & Related papers (2022-04-05T19:39:13Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - AppQ: Warm-starting App Recommendation Based on View Graphs [37.37177133951606]
New apps often have few (or even no) user feedback, suffering from the classic cold-start problem.
Here, a fundamental requirement is the capability to accurately measure an app's quality based on its inborn features, rather than user-generated features.
We propose AppQ, a novel app quality grading and recommendation system that extracts inborn features of apps based on app source code.
arXiv Detail & Related papers (2021-09-08T17:40:48Z) - Evaluating Pre-Trained Models for User Feedback Analysis in Software
Engineering: A Study on Classification of App-Reviews [2.66512000865131]
We study the accuracy and time efficiency of pre-trained neural language models (PTMs) for app review classification.
We set up different studies to evaluate PTMs in multiple settings.
In all cases, Micro and Macro Precision, Recall, and F1-scores will be used.
arXiv Detail & Related papers (2021-04-12T23:23:45Z) - Evaluating Large-Vocabulary Object Detectors: The Devil is in the
Details [107.2722027807328]
We find that the default implementation of AP is neither category independent, nor does it directly reward properly calibrated detectors.
We show that the default implementation produces a gameable metric, where a simple, nonsensical re-ranking policy can improve AP by a large margin.
We benchmark recent advances in large-vocabulary detection and find that many reported gains do not translate to improvements under our new per-class independent evaluation.
arXiv Detail & Related papers (2021-02-01T18:56:02Z) - PiRank: Learning To Rank via Differentiable Sorting [85.28916333414145]
We propose PiRank, a new class of differentiable surrogates for ranking.
We show that PiRank exactly recovers the desired metrics in the limit of zero temperature.
arXiv Detail & Related papers (2020-12-12T05:07:36Z) - Emerging App Issue Identification via Online Joint Sentiment-Topic
Tracing [66.57888248681303]
We propose a novel emerging issue detection approach named MERIT.
Based on the AOBST model, we infer the topics negatively reflected in user reviews for one app version.
Experiments on popular apps from Google Play and Apple's App Store demonstrate the effectiveness of MERIT.
arXiv Detail & Related papers (2020-08-23T06:34:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.