Related papers: Issue-Oriented Agent-Based Framework for Automated Review Comment Generation

Issue-Oriented Agent-Based Framework for Automated Review Comment Generation

URL: http://arxiv.org/abs/2511.00517v1
Date: Sat, 01 Nov 2025 11:44:11 GMT
Title: Issue-Oriented Agent-Based Framework for Automated Review Comment Generation
Authors: Shuochuan Li, Dong Wang, Patanamon Thongtanunam, Zan Wang, Jiuqiao Yu, Junjie Chen,
Abstract summary: RevAgent is a novel agent-based issue-oriented framework for code review comments.<n>It decomposes the task into three stages: Generation, Discrimination, and Training.<n>It significantly outperforms state-of-the-art PLM- and LLM-based baselines.
Score: 15.04868140672973
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code review (CR) is a crucial practice for ensuring software quality. Various automated review comment generation techniques have been proposed to streamline the labor-intensive process. However, existing approaches heavily rely on a single model to identify various issues within the code, limiting the model's ability to handle the diverse, issue-specific nature of code changes and leading to non-informative comments, especially in complex scenarios such as bug fixes. To address these limitations, we propose RevAgent, a novel agent-based issue-oriented framework, decomposes the task into three stages: (1) Generation Stage, where five category-specific commentator agents analyze code changes from distinct issue perspectives and generate candidate comments; (2) Discrimination Stage, where a critic agent selects the most appropriate issue-comment pair; and (3) Training Stage, where all agents are fine-tuned on curated, category-specific data to enhance task specialization. Evaluation results show that RevAgent significantly outperforms state-of-the-art PLM- and LLM-based baselines, with improvements of 12.90\%, 10.87\%, 6.32\%, and 8.57\% on BLEU, ROUGE-L, METEOR, and SBERT, respectively. It also achieves relatively higher accuracy in issue-category identification, particularly for challenging scenarios. Human evaluations further validate the practicality of RevAgent in generating accurate, readable, and context-aware review comments. Moreover, RevAgent delivers a favorable trade-off between performance and efficiency.

Related papers

From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning [12.024430772980502]
We introduce an agent-centric benchmarking paradigm for evaluating large language models.<n>A teacher agent generates candidate problems, an orchestrator agent rigorously verifies their validity and guards against adversarial attacks.<n>If a student correctly solves the problem, the orchestrator prompts the teacher to generate more challenging variants.
arXiv Detail & Related papers (2026-02-27T06:54:32Z)
VeRO: An Evaluation Harness for Agents to Optimize Agents [5.227525836910522]
We introduce VERO (Versioning, Rewards, and Observations), which provides a reproducible evaluation harness with versioned agent snapshots, budget-controlled evaluation, and structured execution traces.<n>We conduct an empirical study comparing target agents and analyzing which modifications reliably improve target agent performance.
arXiv Detail & Related papers (2026-02-25T23:40:22Z)
The Necessity of a Unified Framework for LLM-Based Agent Evaluation [46.631678638677386]
General-purpose agents have seen fundamental advancements.<n> evaluating these agents presents unique challenges that distinguish them from static QA benchmarks.<n>We propose that a unified evaluation framework is essential for the rigorous advancement of agent evaluation.
arXiv Detail & Related papers (2026-02-03T08:18:37Z)
Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation [47.85891728056131]
PRDBench is a novel benchmark comprising 50 real-world Python projects across 20 domains, each with structured Product Requirement Document (PRD) requirements, comprehensive evaluation criteria, and reference implementations.<n>We employ an Agent-as-a-Judge paradigm to score agent outputs, enabling the evaluation of various test types beyond unit tests.
arXiv Detail & Related papers (2025-10-28T12:26:45Z)
Agent4FaceForgery: Multi-Agent LLM Framework for Realistic Face Forgery Detection [108.5042835056188]
This work introduces Agent4FaceForgery to address two fundamental problems.<n>How to capture the diverse intents and iterative processes of human forgery creation.<n>How to model the complex, often adversarial, text-image interactions that accompany forgeries in social media.
arXiv Detail & Related papers (2025-09-16T01:05:01Z)
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z)
Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z)
SemAgent: A Semantics Aware Program Repair Agent [14.80363334219173]
SemAgent is a novel workflow-based procedure that leverages issue, code, and execution semantics to generate patches that are complete.<n>We achieve this through a novel pipeline that (a) leverages execution semantics to retrieve relevant context, (b) comprehends issue-semantics via generalized abstraction, and (c) isolates code-semantics within the context of this abstraction.<n>Our evaluations show that our methodology achieves a solve rate of 44.66% on the SWEBench-Lite benchmark beating all other workflow-based approaches, and an absolute improvement of 7.66% compared to our baseline.
arXiv Detail & Related papers (2025-06-19T23:27:58Z)
On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows [71.92083784393418]
Agentic AI (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low.<n>Inference-time alignment relies on three components: sampling, evaluation, and feedback.<n>We introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques.
arXiv Detail & Related papers (2025-04-02T17:40:47Z)
MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation [34.66546005629471]
Large Language Models (LLMs) are essential tools for various natural language processing tasks but often suffer from generating outdated or incorrect information.<n>Retrieval-Augmented Generation (RAG) addresses this issue by incorporating external, real-time information retrieval to ground LLM responses.<n>To tackle this problem, we propose Multi-Agent Filtering Retrieval-Augmented Generation (MAIN-RAG)<n>MAIN-RAG is a training-free RAG framework that leverages multiple LLM agents to collaboratively filter and score retrieved documents.
arXiv Detail & Related papers (2024-12-31T08:07:26Z)
AEGIS: An Agent-based Framework for General Bug Reproduction from Issue Descriptions [10.686849324750556]
Automated gEneral buG reproductIon Scripts generation framework, named AEGIS, is the first agent-based framework for the task.<n>AEGIS can improve the relative resolved rate of Agentless by 12.5%.
arXiv Detail & Related papers (2024-11-27T03:16:47Z)
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation. It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers. It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.