Related papers: Towards Extracting Software Requirements from App Reviews using Seq2seq Framework

Towards Extracting Software Requirements from App Reviews using Seq2seq Framework

URL: http://arxiv.org/abs/2507.09039v2
Date: Sun, 20 Jul 2025 04:43:55 GMT
Title: Towards Extracting Software Requirements from App Reviews using Seq2seq Framework
Authors: Aakash Sorathiya, Gouri Ginde,
Abstract summary: We propose a Named Entity Recognition (NER) task based on the sequence-to-sequence (Seq2seq) generation approach.<n>With this aim, we propose a Seq2seq framework, incorporating a BiLSTM encoder and an LSTM decoder, enhanced with a self-attention mechanism, GloVe embeddings, and a CRF model.<n>We evaluate our framework on two datasets: a manually annotated set of 1,000 reviews and a crowdsourced set of 23,816 reviews.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mobile app reviews are a large-scale data source for software improvements. A key task in this context is effectively extracting requirements from app reviews to analyze the users' needs and support the software's evolution. Recent studies show that existing methods fail at this task since app reviews usually contain informal language, grammatical and spelling errors, and a large amount of irrelevant information that might not have direct practical value for developers. To address this, we propose a novel reformulation of requirements extraction as a Named Entity Recognition (NER) task based on the sequence-to-sequence (Seq2seq) generation approach. With this aim, we propose a Seq2seq framework, incorporating a BiLSTM encoder and an LSTM decoder, enhanced with a self-attention mechanism, GloVe embeddings, and a CRF model. We evaluated our framework on two datasets: a manually annotated set of 1,000 reviews (Dataset 1) and a crowdsourced set of 23,816 reviews (Dataset 2). The quantitative evaluation of our framework showed that it outperformed existing state-of-the-art methods with an F1 score of 0.96 on Dataset 2, and achieved comparable performance on Dataset 1 with an F1 score of 0.47.

Related papers

Evaluating Large Language Models on Non-Code Software Engineering Tasks [4.381476817430934]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code understanding and generation.<n>We present the first comprehensive benchmark, which we name Software Engineering Language Understanding' (SELU)<n>SELU covers classification, regression, Named Entity Recognition (NER) and Masked Language Modeling (MLM) targets, with data drawn from diverse sources.
arXiv Detail & Related papers (2025-06-12T15:52:32Z)
CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming [56.17331530444765]
CPRet is a retrieval-oriented benchmark suite for competitive programming.<n>It covers four retrieval tasks: two code-centric (i.e., Text-to-Code and Code-to-Code) and two newly proposed problem-centric tasks (i.e., Problem-to-Duplicate and Simplified-to-Full)<n>Our contribution includes both high-quality training data and temporally separated test sets for reliable evaluation.
arXiv Detail & Related papers (2025-05-19T10:07:51Z)
Grounding Synthetic Data Evaluations of Language Models in Unsupervised Document Corpora [9.871701356351542]
Language Models (LMs) continue to advance, improving response quality and coherence.<n>A plethora of evaluation benchmarks have been constructed to assess model quality, response appropriateness, and reasoning capabilities.<n>We propose a methodology for automating the construction of fact-based synthetic data model evaluations grounded in document populations.
arXiv Detail & Related papers (2025-05-13T18:50:03Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
START: Self-taught Reasoner with Tools [51.38785489790888]
We introduce START (Self-Taught Reasoner with Tools), a tool-integrated long Chain-of-thought (CoT) reasoning LLM.<n> START is capable of performing complex computations, self-checking, exploring diverse methods, and self-ging.<n>It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B.
arXiv Detail & Related papers (2025-03-06T17:11:51Z)
QuIM-RAG: Advancing Retrieval-Augmented Generation with Inverted Question Matching for Enhanced QA Performance [1.433758865948252]
This work presents a novel architecture for building Retrieval-Augmented Generation (RAG) systems.<n>RAG architecture is constructed to generate responses from the target document.<n>We introduce QuIM-RAG, a novel approach for the retrieval mechanism in our system.
arXiv Detail & Related papers (2025-01-06T01:07:59Z)
How to Evaluate Entity Resolution Systems: An Entity-Centric Framework with Application to Inventor Name Disambiguation [1.7812428873698403]
We propose an entity-centric data labeling methodology that integrates with a unified framework for monitoring summary statistics. These benchmark data sets can then be used for model training and a variety of evaluation tasks.
arXiv Detail & Related papers (2024-04-08T15:53:29Z)
T-FREX: A Transformer-based Feature Extraction Method from Mobile App Reviews [5.235401361674881]
We present T-FREX, a Transformer-based, fully automatic approach for mobile app review feature extraction. First, we collect a set of ground truth features from users in a real crowdsourced software recommendation platform. Then, we use this newly created dataset to fine-tune multiple LLMs on a named entity recognition task.
arXiv Detail & Related papers (2024-01-08T11:43:03Z)
PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale [53.92008514395125]
PAXQA (Projecting annotations for cross-lingual (x) QA) decomposes cross-lingual QA into two stages. We propose a novel use of lexically-constrained machine translation, in which constrained entities are extracted from the parallel bitexts. We show that models fine-tuned on these datasets outperform prior synthetic data generation models over several extractive QA datasets.
arXiv Detail & Related papers (2023-04-24T15:46:26Z)
Structured Multimodal Attentions for TextVQA [57.71060302874151]
We propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above. SMA first uses a structural graph representation to encode the object-object, object-text and text-text relationships appearing in the image, and then designs a multimodal graph attention network to reason over it. Our proposed model outperforms the SoTA models on TextVQA dataset and two tasks of ST-VQA dataset among all models except pre-training based TAP.
arXiv Detail & Related papers (2020-06-01T07:07:36Z)
Dynamic Refinement Network for Oriented and Densely Packed Object Detection [75.29088991850958]
We present a dynamic refinement network that consists of two novel components, i.e., a feature selection module (FSM) and a dynamic refinement head (DRH) Our FSM enables neurons to adjust receptive fields in accordance with the shapes and orientations of target objects, whereas the DRH empowers our model to refine the prediction dynamically in an object-aware manner. We perform quantitative evaluations on several publicly available benchmarks including DOTA, HRSC2016, SKU110K, and our own SKU110K-R dataset.
arXiv Detail & Related papers (2020-05-20T11:35:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.