Related papers: On the Evaluation of Commit Message Generation Models: An Experimental Study

On the Evaluation of Commit Message Generation Models: An Experimental Study

URL: http://arxiv.org/abs/2107.05373v2
Date: Tue, 13 Jul 2021 02:04:53 GMT
Title: On the Evaluation of Commit Message Generation Models: An Experimental Study
Authors: Wei Tao, Yanlin Wang, Ensheng Shi, Lun Du, Hongyu Zhang, Dongmei Zhang, Wenqiang Zhang
Abstract summary: Commit messages are natural language descriptions of code changes, which are important for program understanding and maintenance. Various approaches utilizing generation or retrieval techniques have been proposed to automatically generate commit messages. This paper conducts a systematic and in-depth analysis of the state-of-the-art models and datasets.
Score: 33.19314967188712
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Commit messages are natural language descriptions of code changes, which are important for program understanding and maintenance. However, writing commit messages manually is time-consuming and laborious, especially when the code is updated frequently. Various approaches utilizing generation or retrieval techniques have been proposed to automatically generate commit messages. To achieve a better understanding of how the existing approaches perform in solving this problem, this paper conducts a systematic and in-depth analysis of the state-of-the-art models and datasets. We find that: (1) Different variants of the BLEU metric are used in previous works, which affects the evaluation and understanding of existing methods. (2) Most existing datasets are crawled only from Java repositories while repositories in other programming languages are not sufficiently explored. (3) Dataset splitting strategies can influence the performance of existing models by a large margin. Some models show better performance when the datasets are split by commit, while other models perform better when the datasets are split by timestamp or by project. Based on our findings, we conduct a human evaluation and find the BLEU metric that best correlates with the human scores for the task. We also collect a large-scale, information-rich, and multi-language commit message dataset MCMD and evaluate existing models on this dataset. Furthermore, we conduct extensive experiments under different dataset splitting strategies and suggest the suitable models under different scenarios. Based on the experimental results and findings, we provide feasible suggestions for comprehensively evaluating commit message generation models and discuss possible future research directions. We believe this work can help practitioners and researchers better evaluate and select models for automatic commit message generation.

Related papers

Automated Generation of Commit Messages in Software Repositories [0.7366405857677226]
Commit messages are crucial for documenting software changes, aiding in program comprehension and maintenance. Our research presents an automated approach to generate commit messages using Machine Learning (ML) and Natural Language Processing (NLP) We used the dataset of code changes and corresponding commit messages that was used by Liu et al.
arXiv Detail & Related papers (2025-04-17T15:08:05Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning. Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks. We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks. We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance. We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z)
Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists [41.94295877935867]
We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science. We demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs.
arXiv Detail & Related papers (2024-10-30T17:59:01Z)
EmbedLLM: Learning Compact Representations of Large Language Models [28.49433308281983]
We propose EmbedLLM, a framework designed to learn compact vector representations of Large Language Models. We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness. Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency.
arXiv Detail & Related papers (2024-10-03T05:43:24Z)
DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery. Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering. Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z)
CommitBench: A Benchmark for Commit Message Generation [22.03783968903916]
We show that existing datasets exhibit various problems, such as the quality of the commit selection. We compile a new large-scale dataset, CommitBench, adopting best practices for dataset creation. We use CommitBench to compare existing models and show that other approaches are outperformed by a Transformer model pretrained on source code.
arXiv Detail & Related papers (2024-03-08T09:56:45Z)
TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models [20.09470051458651]
We introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries. Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature.
arXiv Detail & Related papers (2023-05-18T17:58:35Z)
DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models. We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn. Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z)
Self-augmented Data Selection for Few-shot Dialogue Generation [18.794770678708637]
We adopt the self-training framework to deal with the few-shot MR-to-Text generation problem. We propose a novel data selection strategy to select the data that our generation model is most uncertain about.
arXiv Detail & Related papers (2022-05-19T16:25:50Z)
Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore. We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z)
When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data [84.87772675171412]
We study the circumstances under which explanations of individual data points can improve modeling performance. We make use of three existing datasets with explanations: e-SNLI, TACRED, SemEval.
arXiv Detail & Related papers (2021-02-03T18:57:08Z)
Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task. The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them. By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.