Evaluating Code Summarization Techniques: A New Metric and an Empirical
Characterization
- URL: http://arxiv.org/abs/2312.15475v1
- Date: Sun, 24 Dec 2023 13:12:39 GMT
- Title: Evaluating Code Summarization Techniques: A New Metric and an Empirical
Characterization
- Authors: Antonio Mastropaolo, Matteo Ciniselli, Massimiliano Di Penta, Gabriele
Bavota
- Abstract summary: We investigate the complementarity of different types of metrics in capturing the quality of a generated summary.
We present a new metric based on contrastive learning to capture said aspect.
- Score: 16.127739014966487
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several code summarization techniques have been proposed in the literature to
automatically document a code snippet or a function. Ideally, software
developers should be involved in assessing the quality of the generated
summaries. However, in most cases, researchers rely on automatic evaluation
metrics such as BLEU, ROUGE, and METEOR. These metrics are all based on the
same assumption: The higher the textual similarity between the generated
summary and a reference summary written by developers, the higher its quality.
However, there are two reasons for which this assumption falls short: (i)
reference summaries, e.g., code comments collected by mining software
repositories, may be of low quality or even outdated; (ii) generated summaries,
while using a different wording than a reference one, could be semantically
equivalent to it, thus still being suitable to document the code snippet. In
this paper, we perform a thorough empirical investigation on the
complementarity of different types of metrics in capturing the quality of a
generated summary. Also, we propose to address the limitations of existing
metrics by considering a new dimension, capturing the extent to which the
generated summary aligns with the semantics of the documented code snippet,
independently from the reference summary. To this end, we present a new metric
based on contrastive learning to capture said aspect. We empirically show that
the inclusion of this novel dimension enables a more effective representation
of developers' evaluations regarding the quality of automatically generated
summaries.
Related papers
- Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed.
We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document.
Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - TRUE: Re-evaluating Factual Consistency Evaluation [29.888885917330327]
We introduce TRUE: a comprehensive study of factual consistency metrics on a standardized collection of existing texts from diverse tasks.
Our standardization enables an example-level meta-evaluation protocol that is more actionable and interpretable than previously reported correlations.
Across diverse state-of-the-art metrics and 11 datasets we find that large-scale NLI and question generation-and-answering-based approaches achieve strong and complementary results.
arXiv Detail & Related papers (2022-04-11T10:14:35Z) - WIDAR -- Weighted Input Document Augmented ROUGE [26.123086537577155]
The proposed metric WIDAR is designed to adapt the evaluation score according to the quality of the reference summary.
The proposed metric correlates better than ROUGE by 26%, 76%, 82%, and 15%, respectively, in coherence, consistency, fluency, and relevance on human judgement scores.
arXiv Detail & Related papers (2022-01-23T14:40:42Z) - Understanding the Extent to which Summarization Evaluation Metrics
Measure the Information Quality of Summaries [74.28810048824519]
We analyze the token alignments used by ROUGE and BERTScore to compare summaries.
We argue that their scores largely cannot be interpreted as measuring information overlap.
arXiv Detail & Related papers (2020-10-23T15:55:15Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z) - SueNes: A Weakly Supervised Approach to Evaluating Single-Document
Summarization via Negative Sampling [25.299937353444854]
We present a proof-of-concept study to a weakly supervised summary evaluation approach without the presence of reference summaries.
Massive data in existing summarization datasets are transformed for training by pairing documents with corrupted reference summaries.
arXiv Detail & Related papers (2020-05-13T15:40:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.