Related papers: Towards Automated Classification of Code Review Feedback to Support Analytics

Towards Automated Classification of Code Review Feedback to Support Analytics

URL: http://arxiv.org/abs/2307.03852v1
Date: Fri, 7 Jul 2023 21:53:20 GMT
Title: Towards Automated Classification of Code Review Feedback to Support Analytics
Authors: Asif Kamal Turzo and Fahim Faysal and Ovi Poddar and Jaydeb Sarker and Anindya Iqbal and Amiangshu Bosu
Abstract summary: This study aims to develop an automated code review comment classification system. We trained and evaluated supervised learning-based DNN models leveraging code context, comment text, and a set of code metrics. Our approach outperforms Fregnan et al.'s approach by achieving 18.7% higher accuracy.
Score: 4.423428708304586
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Background: As improving code review (CR) effectiveness is a priority for many software development organizations, projects have deployed CR analytics platforms to identify potential improvement areas. The number of issues identified, which is a crucial metric to measure CR effectiveness, can be misleading if all issues are placed in the same bin. Therefore, a finer-grained classification of issues identified during CRs can provide actionable insights to improve CR effectiveness. Although a recent work by Fregnan et al. proposed automated models to classify CR-induced changes, we have noticed two potential improvement areas -- i) classifying comments that do not induce changes and ii) using deep neural networks (DNN) in conjunction with code context to improve performances. Aims: This study aims to develop an automated CR comment classifier that leverages DNN models to achieve a more reliable performance than Fregnan et al. Method: Using a manually labeled dataset of 1,828 CR comments, we trained and evaluated supervised learning-based DNN models leveraging code context, comment text, and a set of code metrics to classify CR comments into one of the five high-level categories proposed by Turzo and Bosu. Results: Based on our 10-fold cross-validation-based evaluations of multiple combinations of tokenization approaches, we found a model using CodeBERT achieving the best accuracy of 59.3%. Our approach outperforms Fregnan et al.'s approach by achieving 18.7% higher accuracy. Conclusion: Besides facilitating improved CR analytics, our proposed model can be useful for developers in prioritizing code review feedback and selecting reviewers.

Related papers

Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.92083784393418]
Inference-time methods such as Best-of-N (BON) sampling offer a simple yet effective alternative to improve performance. We propose Iterative Agent Decoding (IAD) which combines iterative refinement with dynamic candidate evaluation and selection guided by a verifier.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
Harnessing Large Language Models for Curated Code Reviews [2.5944208050492183]
In code review, generating structured and relevant comments is crucial for identifying code issues and facilitating accurate code changes. Existing code review datasets are often noisy and unrefined, posing limitations to the learning potential of AI models. We propose a curation pipeline designed to enhance the quality of the largest publicly available code review dataset.
arXiv Detail & Related papers (2025-02-05T18:15:09Z)
RealCritic: Towards Effectiveness-Driven Evaluation of Language Model Critiques [59.861013614500024]
We introduce a new benchmark designed to assess the critique capabilities of Large Language Models (LLMs) Unlike existing benchmarks, which typically function in an open-loop fashion, our approach employs a closed-loop methodology that evaluates the quality of corrections generated from critiques.
arXiv Detail & Related papers (2025-01-24T13:48:10Z)
Hold On! Is My Feedback Useful? Evaluating the Usefulness of Code Review Comments [0.0]
This paper investigates the usefulness of Code Review Comments (CR comments) through textual feature-based and featureless approaches. Our models outperform the baseline by achieving state-of-the-art performance. Our analyses portray the similarities and differences of domains, projects, datasets, models, and features for predicting the usefulness of CR comments.
arXiv Detail & Related papers (2025-01-12T07:22:13Z)
Enabling Scalable Oversight via Self-Evolving Critic [59.861013614500024]
SCRIT (Self-evolving CRITic) is a framework that enables genuine self-evolution of critique abilities. It self-improves by training on synthetic data, generated by a contrastive-based self-critic. It achieves up to a 10.3% improvement on critique-correction and error identification benchmarks.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks [68.49251303172674]
State-of-the-art large language models (LLMs) exhibit impressive problem-solving capabilities but may struggle with complex reasoning and factual correctness. Existing methods harness the strengths of chain-of-thought and retrieval-augmented generation (RAG) to decompose a complex problem into simpler steps and apply retrieval to improve factual correctness. We introduce Critic-guided planning with Retrieval-augmentation, CR-Planner, a novel framework that leverages fine-tuned critic models to guide both reasoning and retrieval processes through planning.
arXiv Detail & Related papers (2024-10-02T11:26:02Z)
Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains. We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z)
Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework [2.4861619769660637]
We propose an estimands framework adapted from international clinical trials guidelines. This framework provides a systematic structure for inference and reporting in evaluations. We demonstrate how the framework can help uncover underlying issues, their causes, and potential solutions.
arXiv Detail & Related papers (2024-06-14T18:47:37Z)
Overcoming Pitfalls in Graph Contrastive Learning Evaluation: Toward Comprehensive Benchmarks [60.82579717007963]
We introduce an enhanced evaluation framework designed to more accurately gauge the effectiveness, consistency, and overall capability of Graph Contrastive Learning (GCL) methods.
arXiv Detail & Related papers (2024-02-24T01:47:56Z)
Reassessing Java Code Readability Models with a Human-Centered Approach [3.798885293742468]
This research assesses existing Java Code Readability (CR) models for Large Language Models (LLMs) adjustments. We identify 12 key code aspects influencing CR that were assessed by 390 programmers when labeling 120 AI-generated snippets. Our findings indicate that when AI generates concise and executable code, it is often considered readable by CR models and developers.
arXiv Detail & Related papers (2024-01-26T15:18:22Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
Better Handling Coreference Resolution in Aspect Level Sentiment Classification by Fine-Tuning Language Models [4.2605449879340656]
Monitoring customer feedback can be automated with Aspect Level Sentiment Classification (ALSC) Large Language Models (LLMs) are the heart of many state-of-the-art ALSC solutions, but they perform poorly in some scenarios requiring Coreference Resolution (CR) We propose a framework to improve an LLM's performance on CR-containing reviews by fine tuning on highly inferential tasks.
arXiv Detail & Related papers (2023-07-11T12:43:28Z)
What Makes a Code Review Useful to OpenDev Developers? An Empirical Investigation [4.061135251278187]
Even a minor improvement in the effectiveness of Code Reviews can incur significant savings for a software development organization. This study aims to develop a finer grain understanding of what makes a code review comment useful to OSS developers.
arXiv Detail & Related papers (2023-02-22T22:48:27Z)
Augmentation-induced Consistency Regularization for Classification [25.388324221293203]
We propose a consistency regularization framework based on data augmentation, called CR-Aug. CR-Aug forces the output distributions of different sub models generated by data augmentation to be consistent with each other. We implement CR-Aug to image and audio classification tasks and conduct extensive experiments to verify its effectiveness.
arXiv Detail & Related papers (2022-05-25T03:15:36Z)
Adversarial Feature Augmentation and Normalization for Visual Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models. Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings. We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z)
CRACT: Cascaded Regression-Align-Classification for Robust Visual Tracking [97.84109669027225]
We introduce an improved proposal refinement module, Cascaded Regression-Align- Classification (CRAC) CRAC yields new state-of-the-art performances on many benchmarks. In experiments on seven benchmarks including OTB-2015, UAV123, NfS, VOT-2018, TrackingNet, GOT-10k and LaSOT, our CRACT exhibits very promising results in comparison with state-of-the-art competitors.
arXiv Detail & Related papers (2020-11-25T02:18:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.