On the Evaluation of Commit Message Generation Models: An Experimental
Study
- URL: http://arxiv.org/abs/2107.05373v2
- Date: Tue, 13 Jul 2021 02:04:53 GMT
- Title: On the Evaluation of Commit Message Generation Models: An Experimental
Study
- Authors: Wei Tao, Yanlin Wang, Ensheng Shi, Lun Du, Hongyu Zhang, Dongmei
Zhang, Wenqiang Zhang
- Abstract summary: Commit messages are natural language descriptions of code changes, which are important for program understanding and maintenance.
Various approaches utilizing generation or retrieval techniques have been proposed to automatically generate commit messages.
This paper conducts a systematic and in-depth analysis of the state-of-the-art models and datasets.
- Score: 33.19314967188712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Commit messages are natural language descriptions of code changes, which are
important for program understanding and maintenance. However, writing commit
messages manually is time-consuming and laborious, especially when the code is
updated frequently. Various approaches utilizing generation or retrieval
techniques have been proposed to automatically generate commit messages. To
achieve a better understanding of how the existing approaches perform in
solving this problem, this paper conducts a systematic and in-depth analysis of
the state-of-the-art models and datasets. We find that: (1) Different variants
of the BLEU metric are used in previous works, which affects the evaluation and
understanding of existing methods. (2) Most existing datasets are crawled only
from Java repositories while repositories in other programming languages are
not sufficiently explored. (3) Dataset splitting strategies can influence the
performance of existing models by a large margin. Some models show better
performance when the datasets are split by commit, while other models perform
better when the datasets are split by timestamp or by project. Based on our
findings, we conduct a human evaluation and find the BLEU metric that best
correlates with the human scores for the task. We also collect a large-scale,
information-rich, and multi-language commit message dataset MCMD and evaluate
existing models on this dataset. Furthermore, we conduct extensive experiments
under different dataset splitting strategies and suggest the suitable models
under different scenarios. Based on the experimental results and findings, we
provide feasible suggestions for comprehensively evaluating commit message
generation models and discuss possible future research directions. We believe
this work can help practitioners and researchers better evaluate and select
models for automatic commit message generation.
Related papers
- Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists [41.94295877935867]
We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science.
We demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs.
arXiv Detail & Related papers (2024-10-30T17:59:01Z) - EmbedLLM: Learning Compact Representations of Large Language Models [28.49433308281983]
We propose EmbedLLM, a framework designed to learn compact vector representations of Large Language Models.
We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness.
Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency.
arXiv Detail & Related papers (2024-10-03T05:43:24Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - CommitBench: A Benchmark for Commit Message Generation [22.03783968903916]
We show that existing datasets exhibit various problems, such as the quality of the commit selection.
We compile a new large-scale dataset, CommitBench, adopting best practices for dataset creation.
We use CommitBench to compare existing models and show that other approaches are outperformed by a Transformer model pretrained on source code.
arXiv Detail & Related papers (2024-03-08T09:56:45Z) - TrueTeacher: Learning Factual Consistency Evaluation with Large Language
Models [20.09470051458651]
We introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries.
Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature.
arXiv Detail & Related papers (2023-05-18T17:58:35Z) - DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models.
We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn.
Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z) - GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z) - Self-augmented Data Selection for Few-shot Dialogue Generation [18.794770678708637]
We adopt the self-training framework to deal with the few-shot MR-to-Text generation problem.
We propose a novel data selection strategy to select the data that our generation model is most uncertain about.
arXiv Detail & Related papers (2022-05-19T16:25:50Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - When Can Models Learn From Explanations? A Formal Framework for
Understanding the Roles of Explanation Data [84.87772675171412]
We study the circumstances under which explanations of individual data points can improve modeling performance.
We make use of three existing datasets with explanations: e-SNLI, TACRED, SemEval.
arXiv Detail & Related papers (2021-02-03T18:57:08Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.