Exploring Zero-Shot App Review Classification with ChatGPT: Challenges and Potential
- URL: http://arxiv.org/abs/2505.04759v1
- Date: Wed, 07 May 2025 19:39:04 GMT
- Title: Exploring Zero-Shot App Review Classification with ChatGPT: Challenges and Potential
- Authors: Mohit Chaudhary, Chirag Jain, Preethu Rose Anish,
- Abstract summary: This study explores the potential of zero-shot learning with ChatGPT for classifying app reviews into four categories: functional requirement, non-functional requirement, both, or neither.<n>We evaluate ChatGPT's performance on a benchmark dataset of 1,880 manually annotated reviews from ten diverse apps spanning multiple domains.
- Score: 1.1988955088595858
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: App reviews are a critical source of user feedback, offering valuable insights into an app's performance, features, usability, and overall user experience. Effectively analyzing these reviews is essential for guiding app development, prioritizing feature updates, and enhancing user satisfaction. Classifying reviews into functional and non-functional requirements play a pivotal role in distinguishing feedback related to specific app features (functional requirements) from feedback concerning broader quality attributes, such as performance, usability, and reliability (non-functional requirements). Both categories are integral to informed development decisions. Traditional approaches to classifying app reviews are hindered by the need for large, domain-specific datasets, which are often costly and time-consuming to curate. This study explores the potential of zero-shot learning with ChatGPT for classifying app reviews into four categories: functional requirement, non-functional requirement, both, or neither. We evaluate ChatGPT's performance on a benchmark dataset of 1,880 manually annotated reviews from ten diverse apps spanning multiple domains. Our findings demonstrate that ChatGPT achieves a robust F1 score of 0.842 in review classification, despite certain challenges and limitations. Additionally, we examine how factors such as review readability and length impact classification accuracy and conduct a manual analysis to identify review categories more prone to misclassification.
Related papers
- What Users Value and Critique: Large-Scale Analysis of User Feedback on AI-Powered Mobile Apps [2.352412885878654]
We present the first comprehensive, large-scale study of user feedback on AI-powered mobile apps.<n>We leverage a curated dataset of 292 AI-driven apps across 14 categories with 894K AI-specific reviews from Google Play.<n>Our pipeline surfaces both satisfaction with one feature and frustration with another within the same review.
arXiv Detail & Related papers (2025-06-12T14:56:52Z) - An Empirical Study of Evaluating Long-form Question Answering [77.8023489322551]
We collect 5,236 factoid and non-factoid long-form answers generated by different large language models.<n>We conduct a human evaluation on 2,079 of them, focusing on correctness and informativeness.<n>We find that the style, length of the answers, and the category of questions can bias the automatic evaluation metrics.
arXiv Detail & Related papers (2025-04-25T15:14:25Z) - LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews [74.87393214734114]
This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories.<n>Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting.<n> instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points.
arXiv Detail & Related papers (2025-04-15T10:07:33Z) - CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - How Effectively Do LLMs Extract Feature-Sentiment Pairs from App Reviews? [2.218667838700643]
This study compares the performance of state-of-the-art LLMs, including GPT-4, ChatGPT, and different variants of Llama-2 chat.<n>For predicting positive and neutral sentiments, GPT-4 achieves f1-scores of 76% and 45% in the zero-shot setting.
arXiv Detail & Related papers (2024-09-11T10:21:13Z) - Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback.
The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied.
We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z) - Fairness Concerns in App Reviews: A Study on AI-based Mobile Apps [9.948068408730654]
This research aims to investigate fairness concerns raised in mobile app reviews.
Our research focuses on AI-based mobile app reviews as the chance of unfair behaviors and outcomes in AI-based apps may be higher than in non-AI-based apps.
arXiv Detail & Related papers (2024-01-16T03:43:33Z) - Evaluation of ChatGPT Feedback on ELL Writers' Coherence and Cohesion [0.7028778922533686]
ChatGPT has had a transformative effect on education where students are using it to help with homework assignments and teachers are actively employing it in their teaching practices.
This study evaluated the quality of the feedback generated by ChatGPT regarding the coherence and cohesion of the essays written by English Language learners (ELLs) students.
arXiv Detail & Related papers (2023-10-10T10:25:56Z) - Can GitHub Issues Help in App Review Classifications? [0.7366405857677226]
We propose a novel approach that assists in augmenting labeled datasets by utilizing information extracted from GitHub issues.
Our results demonstrate that using labeled issues for data augmentation can improve the F1-score to 6.3 in bug reports and 7.2 in feature requests.
arXiv Detail & Related papers (2023-08-27T22:01:24Z) - SIFN: A Sentiment-aware Interactive Fusion Network for Review-based Item
Recommendation [48.1799451277808]
We propose a Sentiment-aware Interactive Fusion Network (SIFN) for review-based item recommendation.
We first encode user/item reviews via BERT and propose a light-weighted sentiment learner to extract semantic features of each review.
Then, we propose a sentiment prediction task that guides the sentiment learner to extract sentiment-aware features via explicit sentiment labels.
arXiv Detail & Related papers (2021-08-18T08:04:38Z) - TOUR: Dynamic Topic and Sentiment Analysis of User Reviews for Assisting
App Release [34.529117157417176]
TOUR is able to (i) detect and summarize emerging app issues over app versions, (ii) identify user sentiment towards app features, and (iii) prioritize important user reviews for facilitating developers' examination.
arXiv Detail & Related papers (2021-03-26T08:44:55Z) - Emerging App Issue Identification via Online Joint Sentiment-Topic
Tracing [66.57888248681303]
We propose a novel emerging issue detection approach named MERIT.
Based on the AOBST model, we infer the topics negatively reflected in user reviews for one app version.
Experiments on popular apps from Google Play and Apple's App Store demonstrate the effectiveness of MERIT.
arXiv Detail & Related papers (2020-08-23T06:34:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.