Related papers: CCISolver: End-to-End Detection and Repair of Method-Level Code-Comment Inconsistency

CCISolver: End-to-End Detection and Repair of Method-Level Code-Comment Inconsistency

URL: http://arxiv.org/abs/2506.20558v1
Date: Wed, 25 Jun 2025 15:56:07 GMT
Title: CCISolver: End-to-End Detection and Repair of Method-Level Code-Comment Inconsistency
Authors: Renyi Zhong, Yintong Huo, Wenwei Gu, Jinxi Kuang, Zhihan Jiang, Guangba Yu, Yichen Li, David Lo, Michael R. Lyu,
Abstract summary: Code-comment inconsistency (CCI) can negatively affect software development, testing, and maintenance.<n>We present an innovative end-to-end framework, CCIBench, designed to improve code quality by identifying and rectifying CCIs.
Score: 33.30328162446649
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Comments within code serve as a crucial foundation for software documentation, facilitating developers to communicate and understand the code effectively. However, code-comment inconsistency (CCI) can negatively affect software development, testing, and maintenance. Recent efforts to mitigate this issue have emerged, but existing studies often suffer from inaccurate datasets and inadequate solutions, weakening their practical effectiveness. In this study, we first conduct a quantitative analysis of existing datasets, revealing a substantial portion of sampled data are mislabeled. To address these data limitations, we introduce CCIBench, a refined dataset comprising high-quality data, to support the training and evaluation of method-level CCI methods. Furthermore, we present an innovative end-to-end LLM-based framework, CCISolver, designed to improve code quality by identifying and rectifying CCIs. Comprehensive evaluations demonstrate CCISolver's superior performance. For detection, it establishes a new state-of-the-art with an F1-score of 89.54%. In fixing task, it achieves a remarkable 18.84% relative improvement in GLEU score over the strongest baseline. This superiority is confirmed by human evaluation, where CCISolver's fixing success rate of 0.6533 significantly surpasses existing methods. Critically, in a practical end-to-end setting, CCISolver's innovative architecture is approximately 36% faster for inference than the baseline model, underscoring its scalability and real-world applicability.

Related papers

LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.55869466207234]
Existing evaluation of Large Language Models (LLMs) on static benchmarks is vulnerable to data contamination and leaderboard overfitting.<n>We introduce LLMEval-3, a framework for dynamic evaluation of LLMs.<n>LLEval-3 is built on a proprietary bank of 220k graduate-level questions, from which it dynamically samples unseen test sets for each evaluation run.
arXiv Detail & Related papers (2025-08-07T14:46:30Z)
Refactoring $\ eq$ Bug-Inducing: Improving Defect Prediction with Code Change Tactics Analysis [54.361900378970134]
Just-in-time defect prediction (JIT-DP) aims to predict the likelihood of code changes resulting in software defects at an early stage.<n>Prior research has largely ignored code during both the evaluation and methodology phases, despite its prevalence.<n>We propose Code chAnge Tactics (CAT) analysis to categorize code and its propagation, which improves labeling accuracy in the JIT-Defects4J dataset by 13.7%.
arXiv Detail & Related papers (2025-07-25T23:29:25Z)
LLMxCPG: Context-Aware Vulnerability Detection Through Code Property Graph-Guided Large Language Models [2.891351178680099]
This paper presents a novel framework integrating Code Property Graphs (CPG) with Large Language Models (LLM) for robust vulnerability detection.<n>Our approach's ability to provide a more concise and accurate representation of code snippets enables the analysis of larger code segments.<n> Empirical evaluation demonstrates LLMxCPG's effectiveness across verified datasets, achieving 15-40% improvements in F1-score over state-of-the-art baselines.
arXiv Detail & Related papers (2025-07-22T13:36:33Z)
Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks.<n>However, improvement is plateauing due to the exhaustion of readily available high-quality data.<n>We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
Enabling Scalable Oversight via Self-Evolving Critic [59.861013614500024]
SCRIT (Self-evolving CRITic) is a framework that enables genuine self-evolution of critique abilities.<n>It self-improves by training on synthetic data, generated by a contrastive-based self-critic.<n>It achieves up to a 10.3% improvement on critique-correction and error identification benchmarks.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
Enhancing Language Model Factuality via Activation-Based Confidence Calibration and Guided Decoding [7.855485779946983]
Calibrating language models (LMs) align their generation confidence with the actual likelihood of answer correctness. We propose an activation-based calibration method, ActCab, which trains a linear layer on top of the LM's last-layer activations. We also propose CoDec, a confidence-guided decoding strategy, to elicit truthful answers with high confidence from LMs.
arXiv Detail & Related papers (2024-06-19T05:33:34Z)
OCDB: Revisiting Causal Discovery with a Comprehensive Benchmark and Evaluation Framework [21.87740178652843]
Causal discovery offers a promising approach to improve transparency and reliability. We propose a flexible evaluation framework with metrics for evaluating differences in causal structures and causal effects. We introduce the Open Causal Discovery Benchmark (OCDB), based on real data, to promote fair comparisons and drive optimization of algorithms.
arXiv Detail & Related papers (2024-06-07T03:09:22Z)
Vulnerability Detection with Code Language Models: How Far Are We? [40.455600722638906]
PrimeVul is a new dataset for training and evaluating code LMs for vulnerability detection. It incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues.
arXiv Detail & Related papers (2024-03-27T14:34:29Z)
Overcoming Pitfalls in Graph Contrastive Learning Evaluation: Toward Comprehensive Benchmarks [60.82579717007963]
We introduce an enhanced evaluation framework designed to more accurately gauge the effectiveness, consistency, and overall capability of Graph Contrastive Learning (GCL) methods.
arXiv Detail & Related papers (2024-02-24T01:47:56Z)
A Self-Refinement Strategy for Noise Reduction in Grammatical Error Correction [54.569707226277735]
Existing approaches for grammatical error correction (GEC) rely on supervised learning with manually created GEC datasets. There is a non-negligible amount of "noise" where errors were inappropriately edited or left uncorrected. We propose a self-refinement method where the key idea is to denoise these datasets by leveraging the prediction consistency of existing models.
arXiv Detail & Related papers (2020-10-07T04:45:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.