Zero-shot Bilingual App Reviews Mining with Large Language Models
- URL: http://arxiv.org/abs/2311.03058v1
- Date: Mon, 6 Nov 2023 12:36:46 GMT
- Title: Zero-shot Bilingual App Reviews Mining with Large Language Models
- Authors: Jialiang Wei, Anne-Lise Courbis, Thomas Lambolais, Binbin Xu, Pierre
Louis Bernard, G\'erard Dray
- Abstract summary: Mini-BAR is a tool that integrates large language models (LLMs) to perform zero-shot mining of user reviews in both English and French.
To evaluate the performance of Mini-BAR, we created a dataset containing 6,000 English and 6,000 French annotated user reviews.
- Score: 0.7340017786387767
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: App reviews from app stores are crucial for improving software requirements.
A large number of valuable reviews are continually being posted, describing
software problems and expected features. Effectively utilizing user reviews
necessitates the extraction of relevant information, as well as their
subsequent summarization. Due to the substantial volume of user reviews, manual
analysis is arduous. Various approaches based on natural language processing
(NLP) have been proposed for automatic user review mining. However, the
majority of them requires a manually crafted dataset to train their models,
which limits their usage in real-world scenarios. In this work, we propose
Mini-BAR, a tool that integrates large language models (LLMs) to perform
zero-shot mining of user reviews in both English and French. Specifically,
Mini-BAR is designed to (i) classify the user reviews, (ii) cluster similar
reviews together, (iii) generate an abstractive summary for each cluster and
(iv) rank the user review clusters. To evaluate the performance of Mini-BAR, we
created a dataset containing 6,000 English and 6,000 French annotated user
reviews and conducted extensive experiments. Preliminary results demonstrate
the effectiveness and efficiency of Mini-BAR in requirement engineering by
analyzing bilingual app reviews. (Replication package containing the code,
dataset, and experiment setups on https://github.com/Jl-wei/mini-bar )
Related papers
- UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset.
Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z) - Can GitHub Issues Help in App Review Classifications? [0.7366405857677226]
We propose a novel approach that assists in augmenting labeled datasets by utilizing information extracted from GitHub issues.
Our results demonstrate that using labeled issues for data augmentation can improve the F1-score to 6.3 in bug reports and 7.2 in feature requests.
arXiv Detail & Related papers (2023-08-27T22:01:24Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Evaluating the Effectiveness of Pre-trained Language Models in
Predicting the Helpfulness of Online Product Reviews [0.21485350418225244]
We compare the use of RoBERTa and XLM-R language models to predict the helpfulness of online product reviews.
We employ the Amazon review dataset for our experiments.
arXiv Detail & Related papers (2023-02-19T18:22:59Z) - Towards a Data-Driven Requirements Engineering Approach: Automatic
Analysis of User Reviews [0.440401067183266]
We provide an automated analysis using CamemBERT, which is a state-of-the-art language model in French.
We created a multi-label classification dataset of 6000 user reviews from three applications in the Health & Fitness field.
The results are encouraging and suggest that it's possible to identify automatically the reviews concerning requests for new features.
arXiv Detail & Related papers (2022-06-29T14:14:54Z) - ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented
Visual Models [102.63817106363597]
We build ELEVATER, the first benchmark to compare and evaluate pre-trained language-augmented visual models.
It consists of 20 image classification datasets and 35 object detection datasets, each of which is augmented with external knowledge.
We will release our toolkit and evaluation platforms for the research community.
arXiv Detail & Related papers (2022-04-19T10:23:42Z) - Learning Opinion Summarizers by Selecting Informative Reviews [81.47506952645564]
We collect a large dataset of summaries paired with user reviews for over 31,000 products, enabling supervised training.
The content of many reviews is not reflected in the human-written summaries, and, thus, the summarizer trained on random review subsets hallucinates.
We formulate the task as jointly learning to select informative subsets of reviews and summarizing the opinions expressed in these subsets.
arXiv Detail & Related papers (2021-09-09T15:01:43Z) - Transfer Learning for Mining Feature Requests and Bug Reports from
Tweets and App Store Reviews [4.446419663487345]
Existing approaches fail to detect feature requests and bug reports with high Recall and acceptable Precision.
We train both monolingual and multilingual BERT models and compare the performance with state-of-the-art methods.
arXiv Detail & Related papers (2021-08-02T06:51:13Z) - Unsupervised Opinion Summarization with Noising and Denoising [85.49169453434554]
We create a synthetic dataset from a corpus of user reviews by sampling a review, pretending it is a summary, and generating noisy versions thereof.
At test time, the model accepts genuine reviews and generates a summary containing salient opinions, treating those that do not reach consensus as noise.
arXiv Detail & Related papers (2020-04-21T16:54:57Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.