Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings
- URL: http://arxiv.org/abs/2410.12046v1
- Date: Tue, 15 Oct 2024 20:32:07 GMT
- Title: Towards Realistic Evaluation of Commit Message Generation by Matching Online and Offline Settings
- Authors: Petr Tsvetkov, Aleksandra Eliseeva, Danny Dig, Alexander Bezzubov, Yaroslav Golubev, Timofey Bryksin, Yaroslav Zharov,
- Abstract summary: Commit message generation is a crucial task in software engineering that is challenging to evaluate correctly.
We use an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments.
Our results indicate that edit distance exhibits the highest correlation, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation.
- Score: 77.20838441870151
- License:
- Abstract: Commit message generation (CMG) is a crucial task in software engineering that is challenging to evaluate correctly. When a CMG system is integrated into the IDEs and other products at JetBrains, we perform online evaluation based on user acceptance of the generated messages. However, performing online experiments with every change to a CMG system is troublesome, as each iteration affects users and requires time to collect enough statistics. On the other hand, offline evaluation, a prevalent approach in the research literature, facilitates fast experiments but employs automatic metrics that are not guaranteed to represent the preferences of real users. In this work, we describe a novel way we employed to deal with this problem at JetBrains, by leveraging an online metric - the number of edits users introduce before committing the generated messages to the VCS - to select metrics for offline experiments. To support this new type of evaluation, we develop a novel markup collection tool mimicking the real workflow with a CMG system, collect a dataset with 57 pairs consisting of commit messages generated by GPT-4 and their counterparts edited by human experts, and design and verify a way to synthetically extend such a dataset. Then, we use the final dataset of 656 pairs to study how the widely used similarity metrics correlate with the online metric reflecting the real users' experience. Our results indicate that edit distance exhibits the highest correlation, whereas commonly used similarity metrics such as BLEU and METEOR demonstrate low correlation. This contradicts the previous studies on similarity metrics for CMG, suggesting that user interactions with a CMG system in real-world settings differ significantly from the responses by human labelers operating within controlled research environments. We release all the code and the dataset for researchers: https://jb.gg/cmg-evaluation.
Related papers
- Improved Diversity-Promoting Collaborative Metric Learning for Recommendation [127.08043409083687]
Collaborative Metric Learning (CML) has recently emerged as a popular method in recommendation systems.
This paper focuses on a challenging scenario where a user has multiple categories of interests.
We propose a novel method called textitDiversity-Promoting Collaborative Metric Learning (DPCML)
arXiv Detail & Related papers (2024-09-02T07:44:48Z) - RAG-Enhanced Commit Message Generation [8.858678357308726]
Commit Message Generation has become a research hotspot.
It is time-consuming to write commit messages manually.
This paper proposes REACT, a REtrieval-Augmented framework for CommiT message generation.
arXiv Detail & Related papers (2024-06-08T16:24:24Z) - Using Large Language Models for Commit Message Generation: A Preliminary
Study [5.5784148764236114]
Large language models (LLMs) can be used to generate commit messages automatically and effectively.
In 78% of the 366 samples, the commit messages generated by LLMs were evaluated by humans as the best.
arXiv Detail & Related papers (2024-01-11T14:06:39Z) - Evaluation Metrics of Language Generation Models for Synthetic Traffic
Generation Tasks [22.629816738693254]
We show that common NLG metrics, like BLEU, are not suitable for evaluating Synthetic Traffic Generation (STG)
We propose and evaluate several metrics designed to compare the generated traffic to the distribution of real user texts.
arXiv Detail & Related papers (2023-11-21T11:26:26Z) - On Generative Agents in Recommendation [58.42840923200071]
Agent4Rec is a user simulator in recommendation based on Large Language Models.
Each agent interacts with personalized recommender models in a page-by-page manner.
arXiv Detail & Related papers (2023-10-16T06:41:16Z) - Pointwise Mutual Information Based Metric and Decoding Strategy for
Faithful Generation in Document Grounded Dialogs [17.691689809414843]
Existing metrics measure the degree of similarity between the generated response and the document's content.
We propose a new metric that utilizes (Conditional) Point-wise Mutual Information (PMI) between the generated response and the source document.
PMI quantifies the extent to which the document influences the generated response.
We build upon this idea to create a new decoding technique that incorporates PMI into the response generation process to predict more faithful responses.
arXiv Detail & Related papers (2023-05-20T13:34:34Z) - T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics [94.69907794006826]
We present a framework that combines the best of both worlds, using both supervised and unsupervised signals from whatever data we have available.
We operationalize this idea by training T5Score, a metric that uses these training signals with mT5 as the backbone.
T5Score achieves the best performance on all datasets against existing top-scoring metrics at the segment level.
arXiv Detail & Related papers (2022-12-12T06:29:04Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - Mining Implicit Entity Preference from User-Item Interaction Data for
Knowledge Graph Completion via Adversarial Learning [82.46332224556257]
We propose a novel adversarial learning approach by leveraging user interaction data for the Knowledge Graph Completion task.
Our generator is isolated from user interaction data, and serves to improve the performance of the discriminator.
To discover implicit entity preference of users, we design an elaborate collaborative learning algorithms based on graph neural networks.
arXiv Detail & Related papers (2020-03-28T05:47:33Z) - AliExpress Learning-To-Rank: Maximizing Online Model Performance without
Going Online [60.887637616379926]
This paper proposes an evaluator-generator framework for learning-to-rank.
It consists of an evaluator that generalizes to evaluate recommendations involving the context, and a generator that maximizes the evaluator score by reinforcement learning.
Our method achieves a significant improvement in terms of Conversion Rate (CR) over the industrial-level fine-tuned model in online A/B tests.
arXiv Detail & Related papers (2020-03-25T10:27:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.