Related papers: Automating App Review Response Generation

Automating App Review Response Generation

URL: http://arxiv.org/abs/2002.03552v1
Date: Mon, 10 Feb 2020 05:23:38 GMT
Title: Automating App Review Response Generation
Authors: Cuiyun Gao, Jichuan Zeng, Xin Xia, David Lo, Michael R. Lyu, Irwin King
Abstract summary: We propose a novel approach RRGen that automatically generates review responses by learning knowledge relations between reviews and their responses. Experiments on 58 apps and 309,246 review-response pairs highlight that RRGen outperforms the baselines by at least 67.4% in terms of BLEU-4.
Score: 67.58267006314415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Previous studies showed that replying to a user review usually has a positive effect on the rating that is given by the user to the app. For example, Hassan et al. found that responding to a review increases the chances of a user updating their given rating by up to six times compared to not responding. To alleviate the labor burden in replying to the bulk of user reviews, developers usually adopt a template-based strategy where the templates can express appreciation for using the app or mention the company email address for users to follow up. However, reading a large number of user reviews every day is not an easy task for developers. Thus, there is a need for more automation to help developers respond to user reviews. Addressing the aforementioned need, in this work we propose a novel approach RRGen that automatically generates review responses by learning knowledge relations between reviews and their responses. RRGen explicitly incorporates review attributes, such as user rating and review length, and learns the relations between reviews and corresponding responses in a supervised way from the available training data. Experiments on 58 apps and 309,246 review-response pairs highlight that RRGen outperforms the baselines by at least 67.4% in terms of BLEU-4 (an accuracy measure that is widely used to evaluate dialogue response generation systems). Qualitative analysis also confirms the effectiveness of RRGen in generating relevant and accurate responses.

Related papers

LazyReview A Dataset for Uncovering Lazy Thinking in NLP Peer Reviews [74.87393214734114]
This work introduces LazyReview, a dataset of peer-review sentences annotated with fine-grained lazy thinking categories. Large Language Models (LLMs) struggle to detect these instances in a zero-shot setting. instruction-based fine-tuning on our dataset significantly boosts performance by 10-20 performance points.
arXiv Detail & Related papers (2025-04-15T10:07:33Z)
Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025 [115.86204862475864]
Review Feedback Agent provides automated feedback on vague comments, content misunderstandings, and unprofessional remarks to reviewers. It was implemented at ICLR 2025 as a large randomized control study. 27% of reviewers who received feedback updated their reviews, and over 12,000 feedback suggestions from the agent were incorporated by those reviewers.
arXiv Detail & Related papers (2025-04-13T22:01:25Z)
Prioritizing App Reviews for Developer Responses on Google Play [1.5771347525430772]
Since 2013, Google Play has allowed developers to respond to user reviews. Only 13% to 18% of developers engage in this practice. We propose a method to prioritize reviews based on response priority.
arXiv Detail & Related papers (2025-02-03T16:56:08Z)
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations [85.81295563405433]
Language model users often issue queries that lack specification, where the context under which a query was issued is not explicit. We present contextualized evaluations, a protocol that synthetically constructs context surrounding an under-specified query and provides it during evaluation. We find that the presence of context can 1) alter conclusions drawn from evaluation, even flipping win rates between model pairs, 2) nudge evaluators to make fewer judgments based on surface-level criteria, like style, and 3) provide new insights about model behavior across diverse contexts.
arXiv Detail & Related papers (2024-11-11T18:58:38Z)
Prompt Optimization with Human Feedback [69.95991134172282]
We study the problem of prompt optimization with human feedback (POHF) We introduce our algorithm named automated POHF (APOHF) The results demonstrate that our APOHF can efficiently find a good prompt using a small number of preference feedback instances.
arXiv Detail & Related papers (2024-05-27T16:49:29Z)
Self-Improving Customer Review Response Generation Based on LLMs [1.9274286238176854]
SCRABLE represents an adaptive customer review response automation that enhances itself with self-optimizing prompts. We introduce an automatic scoring mechanism that mimics the role of a human evaluator to assess the quality of responses generated in customer review domains.
arXiv Detail & Related papers (2024-05-06T20:50:17Z)
Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z)
RefuteBench: Evaluating Refuting Instruction-Following for Large Language Models [17.782410287625645]
This paper proposes a benchmark, RefuteBench, covering tasks such as question answering, machine translation, and email writing. The evaluation aims to assess whether models can positively accept feedback in form of refuting instructions and whether they can consistently adhere to user demands throughout the conversation.
arXiv Detail & Related papers (2024-02-21T01:39:56Z)
Proactive Prioritization of App Issues via Contrastive Learning [2.6763498831034043]
We propose a new framework, PPrior, that enables proactive prioritization of app issues through identifying prominent reviews. PPrior employs a pre-trained T5 model and works in three phases. Phase one adapts the pre-trained T5 model to the user reviews data in a self-supervised fashion. Phase two, we leverage contrastive training to learn a generic and task-independent representation of user reviews.
arXiv Detail & Related papers (2023-03-12T06:23:10Z)
Meaningful Answer Generation of E-Commerce Question-Answering [77.89755281215079]
In e-commerce portals, generating answers for product-related questions has become a crucial task. In this paper, we propose a novel generative neural model, called the Meaningful Product Answer Generator (MPAG) MPAG alleviates the safe answer problem by taking product reviews, product attributes, and a prototype answer into consideration.
arXiv Detail & Related papers (2020-11-14T14:05:30Z)
E-commerce Query-based Generation based on User Review [1.484852576248587]
We propose a novel seq2seq based text generation model to generate answers to user's question based on reviews posted by previous users. Given a user question and/or target sentiment polarity, we extract aspects of interest and generate an answer that summarizes previous relevant user reviews.
arXiv Detail & Related papers (2020-11-11T04:58:31Z)
App-Aware Response Synthesis for User Reviews [7.466973484411213]
AAR Synth is an app-aware response synthesis system. It retrieves the top-K most relevant app reviews and the most relevant snippet from the app description. A fused machine learning model integrates the seq2seq model with a machine reading comprehension model.
arXiv Detail & Related papers (2020-07-31T01:28:02Z)
Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation [69.03658685761538]
Open Domain dialog system evaluation is one of the most important challenges in dialog research. We propose an automatic evaluation model CMADE that automatically cleans self-reported user ratings as it trains on them. Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task.
arXiv Detail & Related papers (2020-05-21T15:14:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.