Related papers: Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification

Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification

URL: http://arxiv.org/abs/2508.09832v1
Date: Wed, 13 Aug 2025 14:07:05 GMT
Title: Exploring the Potential of Large Language Models in Fine-Grained Review Comment Classification
Authors: Linh Nguyen, Chunhua Liu, Hong Yi Lin, Patanamon Thongtanunam,
Abstract summary: Large Language Models (LLMs) can classify 17 categories of code review comments.<n>LLMs achieve better accuracy in classifying the five most useful categories.<n>These results suggest that the LLMs could offer a scalable solution for code review analytics.
Score: 4.61232919707345
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code review is a crucial practice in software development. As code review nowadays is lightweight, various issues can be identified, and sometimes, they can be trivial. Research has investigated automated approaches to classify review comments to gauge the effectiveness of code reviews. However, previous studies have primarily relied on supervised machine learning, which requires extensive manual annotation to train the models effectively. To address this limitation, we explore the potential of using Large Language Models (LLMs) to classify code review comments. We assess the performance of LLMs to classify 17 categories of code review comments. Our results show that LLMs can classify code review comments, outperforming the state-of-the-art approach using a trained deep learning model. In particular, LLMs achieve better accuracy in classifying the five most useful categories, which the state-of-the-art approach struggles with due to low training examples. Rather than relying solely on a specific small training data distribution, our results show that LLMs provide balanced performance across high- and low-frequency categories. These results suggest that the LLMs could offer a scalable solution for code review analytics to improve the effectiveness of the code review process.

Related papers

Can LLM Annotations Replace User Clicks for Learning to Rank? [112.2254432364736]
Large-scale supervised data is essential for training modern ranking models, but obtaining high-quality human annotations is costly.<n>Click data has been widely used as a low-cost alternative, and with recent advances in large language models (LLMs), LLM-based relevance annotation has emerged as another promising annotation.<n> Experiments on both a public dataset, TianGong-ST, and an industrial dataset, Baidu-Click, show that click-supervised models perform better on high-frequency queries.<n>We explore two training strategies -- data scheduling and frequency-aware multi-objective learning -- that integrate both supervision signals.
arXiv Detail & Related papers (2025-11-10T02:26:14Z)
What Types of Code Review Comments Do Developers Most Frequently Resolve? [10.277847378685161]
Large language model (LLM)-powered code review automation tools have been introduced to generate code review comments.<n>This paper investigates the types of review comments written by humans and LLMs, and the types of generated comments that are most frequently resolved by developers.
arXiv Detail & Related papers (2025-10-06T23:32:26Z)
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks [63.562924932512765]
Large Language Models (LLMs) have advanced the state-of-the-art in various coding tasks.<n>LLMs can also serve as judges, assessing and comparing the quality of responses generated by other models.
arXiv Detail & Related papers (2025-07-14T17:56:29Z)
Can Reasoning Help Large Language Models Capture Human Annotator Disagreement? [84.32752330104775]
Variation in human annotation (i.e., disagreements) is common in NLP.<n>We evaluate the influence of different reasoning settings on Large Language Model disagreement modeling.<n>Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling.
arXiv Detail & Related papers (2025-06-24T09:49:26Z)
Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation [14.521056434373213]
Using large language models as evaluators has expanded to code evaluation tasks.<n>This raises a critical, unresolved question: Can LLM judges fairly and robustly evaluate semantically equivalent code with superficial variations?<n>We present the first comprehensive study of this issue, defining six types of potential bias in code evaluation.
arXiv Detail & Related papers (2025-05-22T04:49:33Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [92.62952504133926]
This study evaluated the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks.<n>We developed a taxonomy of bugs for incorrect codes and analyzed the root cause for common bug types.<n>We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.<n>The question of how reliable these evaluators are has emerged as a crucial research question.<n>We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
AI-powered Code Review with LLMs: Early Results [10.37036924997437]
We present a novel approach to improving software quality and efficiency through a Large Language Model (LLM)-based model. Our proposed LLM-based AI agent model is trained on large code repositories. It aims to detect code smells, identify potential bugs, provide suggestions for improvement, and optimize the code.
arXiv Detail & Related papers (2024-04-29T08:27:50Z)
Automating Patch Set Generation from Code Review Comments Using Large Language Models [2.045040820541428]
We provide code contexts to five popular Large Language Models (LLMs) We obtain the suggested code-changes (patch sets) derived from real-world code-review comments. The performance of each model is meticulously assessed by comparing their generated patch sets against the historical data of human-generated patch-sets.
arXiv Detail & Related papers (2024-04-10T02:46:08Z)
Reasoning Runtime Behavior of a Program with LLM: How Far Are We? [25.451857140926943]
Large language models for code (i.e., code LLMs) have shown strong code understanding and generation capabilities. Code reasoning is one of the most essential abilities of code LLMs. We propose a framework, namely REval, for evaluating code reasoning abilities and consistency of code LLMs with program execution.
arXiv Detail & Related papers (2024-03-25T05:37:16Z)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models [56.723509505549536]
InfiBench is the first large-scale freeform question-answering (QA) benchmark for code to our knowledge. It comprises 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings.
arXiv Detail & Related papers (2024-03-11T02:06:30Z)
Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z)
LLMRec: Benchmarking Large Language Models on Recommendation Task [54.48899723591296]
The application of Large Language Models (LLMs) in the recommendation domain has not been thoroughly investigated. We benchmark several popular off-the-shelf LLMs on five recommendation tasks, including rating prediction, sequential recommendation, direct recommendation, explanation generation, and review summarization. The benchmark results indicate that LLMs displayed only moderate proficiency in accuracy-based tasks such as sequential and direct recommendation.
arXiv Detail & Related papers (2023-08-23T16:32:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.