Contextual Code Retrieval for Commit Message Generation: A Preliminary Study
- URL: http://arxiv.org/abs/2507.17690v1
- Date: Wed, 23 Jul 2025 16:54:57 GMT
- Title: Contextual Code Retrieval for Commit Message Generation: A Preliminary Study
- Authors: Bo Xiong, Linghao Zhang, Chong Wang, Peng Liang,
- Abstract summary: A commit message describes the main code changes in a commit and plays a crucial role in software maintenance.<n>Existing commit message generation approaches typically frame it as a direct mapping which inputs a code diff and produces a brief descriptive sentence as output.<n>We argue that relying solely on the code diff is insufficient, as raw code diff fails to capture the full context needed for generating high-quality commit messages.
- Score: 18.46986692375691
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A commit message describes the main code changes in a commit and plays a crucial role in software maintenance. Existing commit message generation (CMG) approaches typically frame it as a direct mapping which inputs a code diff and produces a brief descriptive sentence as output. However, we argue that relying solely on the code diff is insufficient, as raw code diff fails to capture the full context needed for generating high-quality and informative commit messages. In this paper, we propose a contextual code retrieval-based method called C3Gen to enhance CMG by retrieving commit-relevant code snippets from the repository and incorporating them into the model input to provide richer contextual information at the repository scope. In the experiments, we evaluated the effectiveness of C3Gen across various models using four objective and three subjective metrics. Meanwhile, we design and conduct a human evaluation to investigate how C3Gen-generated commit messages are perceived by human developers. The results show that by incorporating contextual code into the input, C3Gen enables models to effectively leverage additional information to generate more comprehensive and informative commit messages with greater practical value in real-world development scenarios. Further analysis underscores concerns about the reliability of similaritybased metrics and provides empirical insights for CMG.
Related papers
- Is Compression Really Linear with Code Intelligence? [60.123628177110206]
textitFormat Annealing is a lightweight, transparent training methodology designed to assess the intrinsic capabilities of pre-trained models equitably.<n>Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and bits-per-character (BPC)<n>Our work provides a more nuanced understanding of compression's role in developing code intelligence and contributes a robust evaluation framework in the code domain.
arXiv Detail & Related papers (2025-05-16T16:59:14Z) - Contextualized Data-Wrangling Code Generation in Computational Notebooks [131.26365849822932]
We propose an automated approach, CoCoMine, to mine data-wrangling code generation examples with clear multi-modal contextual dependency.
We construct CoCoNote, a dataset containing 58,221 examples for Contextualized Data-wrangling Code generation in Notebooks.
Experiment results demonstrate the significance of incorporating data context in data-wrangling code generation.
arXiv Detail & Related papers (2024-09-20T14:49:51Z) - CodeRAG-Bench: Can Retrieval Augment Code Generation? [78.37076502395699]
We conduct a systematic, large-scale analysis of code generation using retrieval-augmented generation.<n>We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks.<n>We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources.
arXiv Detail & Related papers (2024-06-20T16:59:52Z) - COMET: Generating Commit Messages using Delta Graph Context
Representation [2.5899040911480182]
Commit messages explain code changes in a commit and facilitate collaboration among developers.
We propose Comet, a novel approach that captures context of code changes using a graph-based representation.
Tests show Comet outperforms state-of-the-art techniques in terms of bleu-norm and meteor metrics.
arXiv Detail & Related papers (2024-02-02T19:01:52Z) - Commit Messages in the Age of Large Language Models [0.9217021281095906]
We evaluate the performance of OpenAI's ChatGPT for generating commit messages based on code changes.
We compare the results obtained with ChatGPT to previous automatic commit message generation methods that have been trained specifically on commit data.
arXiv Detail & Related papers (2024-01-31T06:47:12Z) - From Commit Message Generation to History-Aware Commit Message
Completion [49.175498083165884]
We argue that if we could shift the focus from commit message generation to commit message completion, we could significantly improve the quality and the personal nature of the resulting commit messages.
Since the existing datasets lack historical data, we collect and share a novel dataset called CommitChronicle, containing 10.7M commits across 20 programming languages.
Our results show that in some contexts, commit message completion shows better results than generation, and that while in general GPT-3.5-turbo performs worse, it shows potential for long and detailed messages.
arXiv Detail & Related papers (2023-08-15T09:10:49Z) - Delving into Commit-Issue Correlation to Enhance Commit Message
Generation Models [13.605167159285374]
Commit message generation is a challenging task in automated software engineering.
tool is a novel paradigm that can introduce the correlation between commits and issues into the training phase of models.
The results show that compared with the original models, the performance of tool-enhanced models is significantly improved.
arXiv Detail & Related papers (2023-07-31T20:35:00Z) - ReCode: Robustness Evaluation of Code Generation Models [90.10436771217243]
We propose ReCode, a comprehensive robustness evaluation benchmark for code generation models.
We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format.
With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt.
arXiv Detail & Related papers (2022-12-20T14:11:31Z) - ECMG: Exemplar-based Commit Message Generation [45.54414179533286]
Commit messages concisely describe the content of code diffs (i.e., code changes) and the intent behind them.
The information retrieval-based methods reuse the commit messages of similar code diffs, while the neural-based methods learn the semantic connection between code diffs and commit messages.
We propose a novel exemplar-based neural commit message generation model, which treats the similar commit message as an exemplar and leverages it to guide the neural network model to generate an accurate commit message.
arXiv Detail & Related papers (2022-03-05T10:55:15Z) - CoreGen: Contextualized Code Representation Learning for Commit Message
Generation [39.383390029545865]
We propose a novel Contextualized code representation learning strategy for commit message Generation (CoreGen)
Experiments on the benchmark dataset demonstrate the superior effectiveness of our model over the baseline models with at least 28.18% improvement in terms of BLEU-4 score.
arXiv Detail & Related papers (2020-07-14T09:43:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.