Related papers: LLMCup: Ranking-Enhanced Comment Updating with LLMs

LLMCup: Ranking-Enhanced Comment Updating with LLMs

URL: http://arxiv.org/abs/2507.08671v1
Date: Fri, 11 Jul 2025 15:11:27 GMT
Title: LLMCup: Ranking-Enhanced Comment Updating with LLMs
Authors: Hua Ge, Juan Zhai, Minxue Pan, Fusen He, Ziyue Tan,
Abstract summary: Large language models (LLMs) have shown impressive performance in software engineering tasks such as comment generation, code synthesis, and program repair.<n>We propose a novel comment updating framework, LLMCup, which first uses multiple prompt strategies to provide diverse candidate updated comments via an LLM, and then employs a ranking model, CupRank, to select the best candidate as final updated comment.
Score: 8.12420131928042
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While comments are essential for enhancing code readability and maintainability in modern software projects, developers are often motivated to update code but not comments, leading to outdated or inconsistent documentation that hinders future understanding and maintenance. Recent approaches such as CUP and HebCup have attempted automatic comment updating using neural sequence-to-sequence models and heuristic rules, respectively. However, these methods can miss or misinterpret crucial information during comment updating, resulting in inaccurate comments, and they often struggle with complex update scenarios. Given these challenges, a promising direction lies in leveraging large language models (LLMs), which have shown impressive performance in software engineering tasks such as comment generation, code synthesis, and program repair. This suggests their strong potential to capture the logic behind code modifications - an ability that is crucial for the task of comment updating. Nevertheless, selecting an appropriate prompt strategy for an LLM on each update case remains challenging. To address this, we propose a novel comment updating framework, LLMCup, which first uses multiple prompt strategies to provide diverse candidate updated comments via an LLM, and then employs a ranking model, CupRank, to select the best candidate as final updated comment. Experimental results demonstrate the effectiveness of LLMCup, with improvements over state-of-the-art baselines (CUP and HebCup) by 49.0%-116.9% in Accuracy, 10.8%-20% in BLEU-4, 4.6% in METEOR, 0.9%-1.9% in F1, and 2.1%-3.4% in SentenceBert similarity. Furthermore, a user study shows that comments updated by LLMCup sometimes surpass human-written updates, highlighting the importance of incorporating human evaluation in comment quality assessment.

Related papers

Readability-Robust Code Summarization via Meta Curriculum Learning [53.44612630063336]
In the real world, code is often poorly structured or obfuscated, significantly degrading model performance.<n>We propose RoFTCodeSum, a novel fine-tuning method that enhances the robustness of code summarization against poorly readable code.
arXiv Detail & Related papers (2026-01-09T02:38:24Z)
Larger Is Not Always Better: Leveraging Structured Code Diffs for Comment Inconsistency Detection [3.0208923532626444]
Comment inconsistency arises when developers modify code but neglect to update the corresponding comments.<n>Recent approaches to code-comment inconsistency (CCI) detection leverage Large Language Models (LLMs)<n>We propose a Just-In-Time CCI detection approach built upon the CodeT5+ backbone.
arXiv Detail & Related papers (2025-12-22T21:17:31Z)
LAURA: Enhancing Code Review Generation with Context-Enriched Retrieval-Augmented LLM [17.54065758880181]
This paper proposes an LLM-based review knowledge-augmented, context-aware framework for code review generation, named LAURA.<n>The framework integrates review retrieval, context augmentation, and systematic guidance to enhance the performance of ChatGPT-4o and DeepSeek v3 in generating code review comments.
arXiv Detail & Related papers (2025-12-01T07:10:23Z)
Function-to-Style Guidance of LLMs for Code Translation [59.487054943812836]
We propose F2STrans, a function-to-style guiding paradigm designed to improve the performance of large language models in code translation.<n>Our approach comprises two key stages: (1) Functional learning, which optimize translation correctness using high-quality source-target code pairs.<n>We introduce a novel code translation benchmark that includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations.
arXiv Detail & Related papers (2025-07-15T08:25:02Z)
Turning the Tide: Repository-based Code Reflection [52.13709676656648]
We introduce LiveRepoReflection, a benchmark for evaluating code understanding and generation in multi-file repository contexts.<n>1,888 rigorously filtered test cases across $6$ programming languages to ensure diversity, correctness, and high difficulty.<n>We also create RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset derived from diverse sources.
arXiv Detail & Related papers (2025-07-14T02:36:27Z)
Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks [18.4734091090676]
We investigate whether replacing human-written comments with large language models improves pre-training datasets.<n>Results show that LLM-generated comments are more semantically consistent with code than human-written ones.
arXiv Detail & Related papers (2025-04-28T03:16:34Z)
ReLearn: Unlearning via Learning for Large Language Models [64.2802606302194]
We propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning.<n>This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation.<n>Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output.
arXiv Detail & Related papers (2025-02-16T16:31:00Z)
Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z)
Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation [5.6001617185032595]
Large language models pretrained on both programming and natural language data tend to perform well in code-oriented tasks. We fine-tune open-source Large language models (LLM) in parameter-efficient, quantized low-rank fashion on consumer-grade hardware to improve review comment generation.
arXiv Detail & Related papers (2024-11-15T12:01:38Z)
AI-powered Code Review with LLMs: Early Results [10.37036924997437]
We present a novel approach to improving software quality and efficiency through a Large Language Model (LLM)-based model. Our proposed LLM-based AI agent model is trained on large code repositories. It aims to detect code smells, identify potential bugs, provide suggestions for improvement, and optimize the code.
arXiv Detail & Related papers (2024-04-29T08:27:50Z)
Automating Patch Set Generation from Code Review Comments Using Large Language Models [2.045040820541428]
We provide code contexts to five popular Large Language Models (LLMs) We obtain the suggested code-changes (patch sets) derived from real-world code-review comments. The performance of each model is meticulously assessed by comparing their generated patch sets against the historical data of human-generated patch-sets.
arXiv Detail & Related papers (2024-04-10T02:46:08Z)
Code Needs Comments: Enhancing Code LLMs with Comment Augmentation [91.52444946362547]
We introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused Large Language Models and observed consistent improvements in performance on two widely-used programming skill benchmarks.
arXiv Detail & Related papers (2024-02-20T13:56:38Z)
Information Association for Language Model Updating by Mitigating LM-Logical Discrepancy [68.31760483418901]
Large Language Models(LLMs) struggle with providing current information due to the outdated pre-training data. Existing methods for updating LLMs, such as knowledge editing and continual fine-tuning, have significant drawbacks in generalizability of new information. We identify the core challenge behind these drawbacks: the LM-logical discrepancy featuring the difference between language modeling probabilities and logical probabilities.
arXiv Detail & Related papers (2023-05-29T19:48:37Z)
Automatically Recommend Code Updates: Are We There Yet? [14.997510035210842]
We present the first evaluation of state-of-the-art CodeLMs for automatically recommending code updates. Our results reveal that while CodeLMs perform well in settings that ignore temporal information, they struggle in more realistic time-wise scenarios. Our findings highlight the significant gap between the perceived and actual effectiveness of CodeLMs for real-world code update recommendation.
arXiv Detail & Related papers (2022-09-15T05:07:25Z)
CoSQA: 20,000+ Web Queries for Code Search and Question Answering [63.92224685262063]
CoSQA dataset includes 20,604 labels for pairs of natural language queries and codes. We introduce a contrastive learning method dubbed CoCLR to enhance query-code matching. We show that evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1%.
arXiv Detail & Related papers (2021-05-27T15:37:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.