On the Evaluation of Commit Message Generation Models: An Experimental
Study
- URL: http://arxiv.org/abs/2107.05373v2
- Date: Tue, 13 Jul 2021 02:04:53 GMT
- Title: On the Evaluation of Commit Message Generation Models: An Experimental
Study
- Authors: Wei Tao, Yanlin Wang, Ensheng Shi, Lun Du, Hongyu Zhang, Dongmei
Zhang, Wenqiang Zhang
- Abstract summary: Commit messages are natural language descriptions of code changes, which are important for program understanding and maintenance.
Various approaches utilizing generation or retrieval techniques have been proposed to automatically generate commit messages.
This paper conducts a systematic and in-depth analysis of the state-of-the-art models and datasets.
- Score: 33.19314967188712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Commit messages are natural language descriptions of code changes, which are
important for program understanding and maintenance. However, writing commit
messages manually is time-consuming and laborious, especially when the code is
updated frequently. Various approaches utilizing generation or retrieval
techniques have been proposed to automatically generate commit messages. To
achieve a better understanding of how the existing approaches perform in
solving this problem, this paper conducts a systematic and in-depth analysis of
the state-of-the-art models and datasets. We find that: (1) Different variants
of the BLEU metric are used in previous works, which affects the evaluation and
understanding of existing methods. (2) Most existing datasets are crawled only
from Java repositories while repositories in other programming languages are
not sufficiently explored. (3) Dataset splitting strategies can influence the
performance of existing models by a large margin. Some models show better
performance when the datasets are split by commit, while other models perform
better when the datasets are split by timestamp or by project. Based on our
findings, we conduct a human evaluation and find the BLEU metric that best
correlates with the human scores for the task. We also collect a large-scale,
information-rich, and multi-language commit message dataset MCMD and evaluate
existing models on this dataset. Furthermore, we conduct extensive experiments
under different dataset splitting strategies and suggest the suitable models
under different scenarios. Based on the experimental results and findings, we
provide feasible suggestions for comprehensively evaluating commit message
generation models and discuss possible future research directions. We believe
this work can help practitioners and researchers better evaluate and select
models for automatic commit message generation.
Related papers
- P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
Large language models (LLMs) showcase varied multilingual capabilities across tasks like translation, code generation, and reasoning.
Previous assessments often limited their scope to fundamental natural language processing (NLP) or isolated capability-specific tasks.
We present a pipeline for selecting available and reasonable benchmarks from massive ones, addressing the oversight in previous work regarding the utility of these benchmarks.
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.
arXiv Detail & Related papers (2024-11-14T01:29:36Z) - Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance.
We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists [41.94295877935867]
We present a benchmark for large language models designed to tackle one of the most knowledge-intensive tasks in data science.
We demonstrate that the FeatEng of our proposal can cheaply and efficiently assess the broad capabilities of LLMs.
arXiv Detail & Related papers (2024-10-30T17:59:01Z) - EmbedLLM: Learning Compact Representations of Large Language Models [28.49433308281983]
We propose EmbedLLM, a framework designed to learn compact vector representations of Large Language Models.
We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness.
Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency.
arXiv Detail & Related papers (2024-10-03T05:43:24Z) - DiscoveryBench: Towards Data-Driven Discovery with Large Language Models [50.36636396660163]
We present DiscoveryBench, the first comprehensive benchmark that formalizes the multi-step process of data-driven discovery.
Our benchmark contains 264 tasks collected across 6 diverse domains, such as sociology and engineering.
Our benchmark, thus, illustrates the challenges in autonomous data-driven discovery and serves as a valuable resource for the community to make progress.
arXiv Detail & Related papers (2024-07-01T18:58:22Z) - CommitBench: A Benchmark for Commit Message Generation [22.03783968903916]
We show that existing datasets exhibit various problems, such as the quality of the commit selection.
We compile a new large-scale dataset, CommitBench, adopting best practices for dataset creation.
We use CommitBench to compare existing models and show that other approaches are outperformed by a Transformer model pretrained on source code.
arXiv Detail & Related papers (2024-03-08T09:56:45Z) - TrueTeacher: Learning Factual Consistency Evaluation with Large Language
Models [20.09470051458651]
We introduce TrueTeacher, a method for generating synthetic data by annotating diverse model-generated summaries.
Unlike prior work, TrueTeacher does not rely on human-written summaries, and is multilingual by nature.
arXiv Detail & Related papers (2023-05-18T17:58:35Z) - DORE: Document Ordered Relation Extraction based on Generative Framework [56.537386636819626]
This paper investigates the root cause of the underwhelming performance of the existing generative DocRE models.
We propose to generate a symbolic and ordered sequence from the relation matrix which is deterministic and easier for model to learn.
Experimental results on four datasets show that our proposed method can improve the performance of the generative DocRE models.
arXiv Detail & Related papers (2022-10-28T11:18:10Z) - Self-augmented Data Selection for Few-shot Dialogue Generation [18.794770678708637]
We adopt the self-training framework to deal with the few-shot MR-to-Text generation problem.
We propose a novel data selection strategy to select the data that our generation model is most uncertain about.
arXiv Detail & Related papers (2022-05-19T16:25:50Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.