Enhancement Report Approval Prediction: A Comparative Study of Large Language Models
- URL: http://arxiv.org/abs/2506.15098v1
- Date: Wed, 18 Jun 2025 03:08:04 GMT
- Title: Enhancement Report Approval Prediction: A Comparative Study of Large Language Models
- Authors: Haosheng Zuo, Feifei Niu, Chuanyi Li,
- Abstract summary: Enhancement reports (ERs) serve as a critical communication channel between users and developers, capturing valuable suggestions for software improvement.<n>To address this challenge, enhancement report approval prediction (ERAP) has emerged as a research focus.<n>Recent advancements in large language models (LLM) present new opportunities for enhancing prediction accuracy.
- Score: 10.243182983724585
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Enhancement reports (ERs) serve as a critical communication channel between users and developers, capturing valuable suggestions for software improvement. However, manually processing these reports is resource-intensive, leading to delays and potential loss of valuable insights. To address this challenge, enhancement report approval prediction (ERAP) has emerged as a research focus, leveraging machine learning techniques to automate decision-making. While traditional approaches have employed feature-based classifiers and deep learning models, recent advancements in large language models (LLM) present new opportunities for enhancing prediction accuracy. This study systematically evaluates 18 LLM variants (including BERT, RoBERTa, DeBERTa-v3, ELECTRA, and XLNet for encoder models; GPT-3.5-turbo, GPT-4o-mini, Llama 3.1 8B, Llama 3.1 8B Instruct and DeepSeek-V3 for decoder models) against traditional methods (CNN/LSTM-BERT/GloVe). Our experiments reveal two key insights: (1) Incorporating creator profiles increases unfine-tuned decoder-only models' overall accuracy by 10.8 percent though it may introduce bias; (2) LoRA fine-tuned Llama 3.1 8B Instruct further improve performance, reaching 79 percent accuracy and significantly enhancing recall for approved reports (76.1 percent vs. LSTM-GLOVE's 64.1 percent), outperforming traditional methods by 5 percent under strict chronological evaluation and effectively addressing class imbalance issues. These findings establish LLM as a superior solution for ERAP, demonstrating their potential to streamline software maintenance workflows and improve decision-making in real-world development environments. We also investigated and summarized the ER cases where the large models underperformed, providing valuable directions for future research.
Related papers
- Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z) - Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback [59.078756231841574]
Critique-GRPO is an online RL framework that integrates both natural language and numerical feedback for effective policy optimization.<n>We show Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks.
arXiv Detail & Related papers (2025-06-03T17:39:02Z) - Reliable Decision Support with LLMs: A Framework for Evaluating Consistency in Binary Text Classification Applications [0.7124971549479361]
This study introduces a framework for evaluating consistency in large language model (LLM) binary text classification.<n>We determine sample size requirements, develop metrics for invalid responses, and evaluate intra- and inter-rater reliability.
arXiv Detail & Related papers (2025-05-20T21:12:58Z) - R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z) - RECSIP: REpeated Clustering of Scores Improving the Precision [0.0]
We introduce the framework REpeated Clustering of Scores Improving the Precision (RECSIP)<n>RECSIP focuses on improving the precision of Large Language Models (LLMs) by asking multiple models in parallel, scoring and clustering their responses to ensure a higher reliability on the response.<n>The evaluation of our reference implementation recsip on the benchmark MMLU-Pro using the models GPT-4o, Claude and Gemini shows an overall increase of 5.8 per cent points compared to the best used model.
arXiv Detail & Related papers (2025-03-15T12:36:32Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation [58.546205554954454]
We propose Enhancing Alignment in MLLMs via Critical Observation (EACO)<n>EACO aligns MLLMs by self-generated preference data using only 5k images economically.<n>EACO reduces the overall hallucinations by 65.6% on HallusionBench and improves the reasoning ability by 21.8% on MME-Cognition.
arXiv Detail & Related papers (2024-12-06T09:59:47Z) - Can Large Language Model Predict Employee Attrition? [0.0]
This study compares the predictive accuracy and interpretability of a fine-tuned GPT-3.5 model against traditional machine learning (ML)
Our findings show that the fine-tuned GPT-3.5 model outperforms traditional methods with a precision of 0.91, recall of 0.94, and an F1-score of 0.92, while the best traditional model, SVM, achieved an F1-score of 0.82, with Random Forest and XGBoost reaching 0.80.
arXiv Detail & Related papers (2024-11-02T19:50:39Z) - A Comprehensive Evaluation of Large Language Models on Mental Illnesses [0.8458496687170665]
GPT-4 and Llama 3 exhibited superior performance in binary disorder detection, with accuracies reaching up to 85% on certain datasets.
prompt engineering played a crucial role in enhancing model performance.
Despite promising results, our analysis identified several challenges, including variability in performance across datasets and the need for careful prompt engineering.
arXiv Detail & Related papers (2024-09-24T02:58:52Z) - Self-Taught Evaluators [77.92610887220594]
We present an approach that aims to im-proves without human annotations, using synthetic training data only.
Our Self-Taught Evaluator can improve a strong LLM from 75.4 to 88.3 on RewardBench.
arXiv Detail & Related papers (2024-08-05T17:57:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.