Related papers: LLM-Cure: LLM-based Competitor User Review Analysis for Feature Enhancement

LLM-Cure: LLM-based Competitor User Review Analysis for Feature Enhancement

URL: http://arxiv.org/abs/2409.15724v1
Date: Tue, 24 Sep 2024 04:17:21 GMT
Title: LLM-Cure: LLM-based Competitor User Review Analysis for Feature Enhancement
Authors: Maram Assi, Safwat Hassan, Ying Zou,
Abstract summary: We propose a large language model (LLM)-based Competitive User Review Analysis for Feature Enhancement. LLM-Cure identifies and categorizes features within reviews by applying LLMs. When provided with a complaint in a user review, LLM-Cure curates highly rated (4 and 5 stars) reviews in competing apps related to the complaint.
Score: 0.7285835869818668
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The exponential growth of the mobile app market underscores the importance of constant innovation and rapid response to user demands. As user satisfaction is paramount to the success of a mobile application (app), developers typically rely on user reviews, which represent user feedback that includes ratings and comments to identify areas for improvement. However, the sheer volume of user reviews poses challenges in manual analysis, necessitating automated approaches. Existing automated approaches either analyze only the target apps reviews, neglecting the comparison of similar features to competitors or fail to provide suggestions for feature enhancement. To address these gaps, we propose a Large Language Model (LLM)-based Competitive User Review Analysis for Feature Enhancement) (LLM-Cure), an approach powered by LLMs to automatically generate suggestion s for mobile app feature improvements. More specifically, LLM-Cure identifies and categorizes features within reviews by applying LLMs. When provided with a complaint in a user review, LLM-Cure curates highly rated (4 and 5 stars) reviews in competing apps related to the complaint and proposes potential improvements tailored to the target application. We evaluate LLM-Cure on 1,056,739 reviews of 70 popular Android apps. Our evaluation demonstrates that LLM-Cure significantly outperforms the state-of-the-art approaches in assigning features to reviews by up to 13% in F1-score, up to 16% in recall and up to 11% in precision. Additionally, LLM-Cure demonstrates its capability to provide suggestions for resolving user complaints. We verify the suggestions using the release notes that reflect the changes of features in the target mobile app. LLM-Cure achieves a promising average of 73% of the implementation of the provided suggestions.

Related papers

LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions. To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline. Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z)
Exploring Requirements Elicitation from App Store User Reviews Using Large Language Models [0.0]
This research introduces an approach leveraging the power of Large Language Models to analyze user reviews for automated requirements elicitation. We fine-tuned three well-established LLMs BERT, DistilBERT, and GEMMA, on a dataset of app reviews labeled for usefulness. Our evaluation revealed BERT's superior performance, achieving an accuracy of 92.40% and an F1-score of 92.39%, demonstrating its effectiveness in accurately classifying useful reviews.
arXiv Detail & Related papers (2024-09-23T18:57:31Z)
A Fine-grained Sentiment Analysis of App Reviews using Large Language Models: An Evaluation Study [1.0787328610467801]
Large Language Models (LLMs) have shown impressive performance on several new tasks without updating the model's parameters. This study compares the performance of state-of-the-art LLMs, including GPT-4, ChatGPT, and LLama-2-chat variants, for extracting app features. Results indicate the best-performing GPT-4 model outperforms rule-based approaches by 23.6% in f1-score with zero-shot feature extraction.
arXiv Detail & Related papers (2024-09-11T10:21:13Z)
AI-Driven Review Systems: Evaluating LLMs in Scalable and Bias-Aware Academic Reviews [18.50142644126276]
We evaluate the alignment of automatic paper reviews with human reviews using an arena of human preferences by pairwise comparisons. We fine-tune an LLM to predict human preferences, predicting which reviews humans will prefer in a head-to-head battle between LLMs. We make the reviews of publicly available arXiv and open-access Nature journal papers available online, along with a free service which helps authors review and revise their research papers and improve their quality.
arXiv Detail & Related papers (2024-08-19T19:10:38Z)
Review-LLM: Harnessing Large Language Models for Personalized Review Generation [8.898103706804616]
Large Language Models (LLMs) have shown superior text modeling and generating ability. We propose Review-LLM that customizes LLMs for personalized review generation.
arXiv Detail & Related papers (2024-07-10T09:22:19Z)
Large Language Model as an Assignment Evaluator: Insights, Feedback, and Challenges in a 1000+ Student Course [49.296957552006226]
Using large language models (LLMs) for automatic evaluation has become an important evaluation method in NLP research. This report shares how we use GPT-4 as an automatic assignment evaluator in a university course with 1,028 students.
arXiv Detail & Related papers (2024-07-07T00:17:24Z)
Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions [77.66677127535222]
Auto-Arena is an innovative framework that automates the entire evaluation process using LLM-powered agents. In our experiments, Auto-Arena shows a 92.14% correlation with human preferences, surpassing all previous expert-annotated benchmarks.
arXiv Detail & Related papers (2024-05-30T17:19:19Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators. The question of how reliable these evaluators are has emerged as a crucial research question. We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
Self-Improving Customer Review Response Generation Based on LLMs [1.9274286238176854]
SCRABLE represents an adaptive customer review response automation that enhances itself with self-optimizing prompts. We introduce an automatic scoring mechanism that mimics the role of a human evaluator to assess the quality of responses generated in customer review domains.
arXiv Detail & Related papers (2024-05-06T20:50:17Z)
Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z)
A Closer Look into Automatic Evaluation Using Large Language Models [75.49360351036773]
We discuss how details in the evaluation process change how well the ratings given by LLMs correlate with human ratings. We find that the auto Chain-of-Thought (CoT) used in G-Eval does not always make G-Eval more aligned with human ratings. We also show that forcing the LLM to output only a numeric rating, as in G-Eval, is suboptimal.
arXiv Detail & Related papers (2023-10-09T12:12:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.