What Users Value and Critique: Large-Scale Analysis of User Feedback on AI-Powered Mobile Apps
- URL: http://arxiv.org/abs/2506.10785v1
- Date: Thu, 12 Jun 2025 14:56:52 GMT
- Title: What Users Value and Critique: Large-Scale Analysis of User Feedback on AI-Powered Mobile Apps
- Authors: Vinaik Chhetri, Krishna Upadhyay, A. B. Siddique, Umar Farooq,
- Abstract summary: We present the first comprehensive, large-scale study of user feedback on AI-powered mobile apps.<n>We leverage a curated dataset of 292 AI-driven apps across 14 categories with 894K AI-specific reviews from Google Play.<n>Our pipeline surfaces both satisfaction with one feature and frustration with another within the same review.
- Score: 2.352412885878654
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Artificial Intelligence (AI)-powered features have rapidly proliferated across mobile apps in various domains, including productivity, education, entertainment, and creativity. However, how users perceive, evaluate, and critique these AI features remains largely unexplored, primarily due to the overwhelming volume of user feedback. In this work, we present the first comprehensive, large-scale study of user feedback on AI-powered mobile apps, leveraging a curated dataset of 292 AI-driven apps across 14 categories with 894K AI-specific reviews from Google Play. We develop and validate a multi-stage analysis pipeline that begins with a human-labeled benchmark and systematically evaluates large language models (LLMs) and prompting strategies. Each stage, including review classification, aspect-sentiment extraction, and clustering, is validated for accuracy and consistency. Our pipeline enables scalable, high-precision analysis of user feedback, extracting over one million aspect-sentiment pairs clustered into 18 positive and 15 negative user topics. Our analysis reveals that users consistently focus on a narrow set of themes: positive comments emphasize productivity, reliability, and personalized assistance, while negative feedback highlights technical failures (e.g., scanning and recognition), pricing concerns, and limitations in language support. Our pipeline surfaces both satisfaction with one feature and frustration with another within the same review. These fine-grained, co-occurring sentiments are often missed by traditional approaches that treat positive and negative feedback in isolation or rely on coarse-grained analysis. To this end, our approach provides a more faithful reflection of the real-world user experiences with AI-powered apps. Category-aware analysis further uncovers both universal drivers of satisfaction and domain-specific frustrations.
Related papers
- VeriMinder: Mitigating Analytical Vulnerabilities in NL2SQL [11.830097026198308]
Application systems using natural language interfaces to databases (NLIDBs) have democratized data analysis.<n>This has also brought forth an urgent challenge to help users who might use these systems without a background in statistical analysis.<n>We present VeriMinder, https://veriminder.ai, an interactive system for detecting and mitigating such analytical vulnerabilities.
arXiv Detail & Related papers (2025-07-23T19:48:12Z) - Exploring Zero-Shot App Review Classification with ChatGPT: Challenges and Potential [1.1988955088595858]
This study explores the potential of zero-shot learning with ChatGPT for classifying app reviews into four categories: functional requirement, non-functional requirement, both, or neither.<n>We evaluate ChatGPT's performance on a benchmark dataset of 1,880 manually annotated reviews from ten diverse apps spanning multiple domains.
arXiv Detail & Related papers (2025-05-07T19:39:04Z) - On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z) - A Contrastive Framework with User, Item and Review Alignment for Recommendation [25.76462243743591]
We introduce a Review-centric Contrastive Alignment Framework for Recommendation (ReCAFR)<n>ReCAFR incorporates reviews into the core learning process, ensuring alignment among user, item, and review representations.<n>Specifically, we leverage two self-supervised contrastive strategies that exploit review-based augmentation to alleviate sparsity.
arXiv Detail & Related papers (2025-01-21T08:21:45Z) - Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language Models [52.368110271614285]
We introduce AdvEval, a novel black-box adversarial framework against NLG evaluators.
AdvEval is specially tailored to generate data that yield strong disagreements between human and victim evaluators.
We conduct experiments on 12 victim evaluators and 11 NLG datasets, spanning tasks including dialogue, summarization, and question evaluation.
arXiv Detail & Related papers (2024-05-23T14:48:15Z) - Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback.
The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied.
We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z) - Human-in-the-loop Fairness: Integrating Stakeholder Feedback to Incorporate Fairness Perspectives in Responsible AI [4.0247545547103325]
Fairness is a growing concern for high-risk decision-making using Artificial Intelligence (AI)
There is no universally accepted fairness measure, fairness is context-dependent, and there might be conflicting perspectives on what is considered fair.
Our work follows an approach where stakeholders can give feedback on specific decision instances and their outcomes with respect to their fairness.
arXiv Detail & Related papers (2023-12-13T11:17:29Z) - UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset.
Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z) - Continually Improving Extractive QA via Human Feedback [59.49549491725224]
We study continually improving an extractive question answering (QA) system via human user feedback.
We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time.
arXiv Detail & Related papers (2023-05-21T14:35:32Z) - SIFN: A Sentiment-aware Interactive Fusion Network for Review-based Item
Recommendation [48.1799451277808]
We propose a Sentiment-aware Interactive Fusion Network (SIFN) for review-based item recommendation.
We first encode user/item reviews via BERT and propose a light-weighted sentiment learner to extract semantic features of each review.
Then, we propose a sentiment prediction task that guides the sentiment learner to extract sentiment-aware features via explicit sentiment labels.
arXiv Detail & Related papers (2021-08-18T08:04:38Z) - Sentiment Analysis of Users' Reviews on COVID-19 Contact Tracing Apps
with a Benchmark Dataset [6.592595861973966]
Contact tracing has been globally adopted in the fight to control the infection rate of COVID-19. Thanks to digital technologies, such as smartphones and wearable devices, contacts of COVID-19 patients can be easily traced and informed about their potential exposure to the virus.
Several interesting mobile applications have been developed. However, there are ever-growing concerns over the working mechanism and performance of these applications.
In this work, we propose a pipeline starting from manual annotation via a crowd-sourcing study and concluding on the development and training of AI models for automatic sentiment analysis of users' reviews.
arXiv Detail & Related papers (2021-03-01T18:43:10Z) - SentiLSTM: A Deep Learning Approach for Sentiment Analysis of Restaurant
Reviews [13.018530502810128]
This paper proposes, a deep learning-based technique (i.e., BiLSTM) to classify the reviews provided by the clients of the restaurant into positive and negative polarities.
The results of the evaluation on test dataset show that BiLSTM technique produced in the highest accuracy of 91.35%.
arXiv Detail & Related papers (2020-11-19T06:24:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.