Optimizing Datasets for Code Summarization: Is Code-Comment Coherence Enough?
- URL: http://arxiv.org/abs/2502.07611v1
- Date: Tue, 11 Feb 2025 15:02:19 GMT
- Title: Optimizing Datasets for Code Summarization: Is Code-Comment Coherence Enough?
- Authors: Antonio Vitale, Antonio Mastropaolo, Rocco Oliveto, Massimiliano Di Penta, Simone Scalabrino,
- Abstract summary: We explore the extent to which code-comment coherence, a specific quality attribute of code summaries, can be used to optimize code summarization datasets.
We examine multiple levels of training instances from two state-of-the-art datasets (TL-CodeSum and Funcom) and evaluate the resulting models on three manually curated test sets.
- Score: 11.865113785648932
- License:
- Abstract: Automated code summarization is a long-standing goal for code comprehension. This task automatically generates documentation using a given method. Deep Learning (DL)-based approaches have been proven beneficial for various software engineering (SE) tasks, including this one. Most state-of-the-art datasets for code summarization are automatically mined from GitHub and, thus, might contain erroneous or sub-optimal examples. Previous work showed that using a simple rule-based approach for removing noisy instances allows for a tangible reduction of the training set size while not reducing the effectiveness of the trained models. Motivated by this finding, we conjecture that it is possible to further reduce the dataset size by removing instances that contain different issues. In this paper, we explore the extent to which code-comment coherence, a specific quality attribute of code summaries, can be used to optimize code summarization datasets. Specifically, we hypothesize that removing incoherent code-comment pairs might positively impact the effectiveness of the models. To do this, we rely on SIDE, a recently introduced metric for code-summary coherence. We examine multiple selectivity levels of training instances from two state-of-the-art datasets (TL-CodeSum and Funcom) and evaluate the resulting models on three manually curated test sets. The results show that even halving the training set sizes does not significantly affect the model's ability to generate summaries. However, when comparing the most restrictive selection strategy with a simpler one that randomly selects the training instances, we observe that the resulting accuracy of the model also does not change. This result suggests that (i) current datasets contain many irrelevant examples, and (ii) different quality attributes should be explored for optimizing code summarization datasets.
Related papers
- Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures [0.0]
This research introduces a novel ensemble learning approach for code similarity assessment.
The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses.
arXiv Detail & Related papers (2024-05-03T13:42:49Z) - GistScore: Learning Better Representations for In-Context Example
Selection with Gist Bottlenecks [3.9638110494107095]
In-context Learning (ICL) is the ability of Large Language Models (LLMs) to perform new tasks when conditioned on prompts.
We propose Example Gisting, a novel approach for training example encoders through supervised fine-tuning.
We show that our fine-tuned models get state-of-the-art ICL performance with over 20% absolute gain over off-the-shelf retrievers.
arXiv Detail & Related papers (2023-11-16T06:28:05Z) - Rethinking Negative Pairs in Code Search [56.23857828689406]
We propose a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE.
We analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation.
arXiv Detail & Related papers (2023-10-12T06:32:42Z) - BaSAL: Size-Balanced Warm Start Active Learning for LiDAR Semantic
Segmentation [2.9290232815049926]
Existing active learning methods overlook the severe class imbalance inherent in LiDAR semantic segmentation datasets.
We propose BaSAL, a size-balanced warm start active learning model, based on the observation that each object class has a characteristic size.
Results show that we are able to improve the performance of the initial model by a large margin.
arXiv Detail & Related papers (2023-10-12T05:03:19Z) - Boosting Commit Classification with Contrastive Learning [0.8655526882770742]
Commit Classification (CC) is an important task in software maintenance.
We propose a contrastive learning-based commit classification framework.
Our framework can solve the CC problem simply but effectively in fewshot scenarios.
arXiv Detail & Related papers (2023-08-16T10:02:36Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - A Lagrangian Duality Approach to Active Learning [119.36233726867992]
We consider the batch active learning problem, where only a subset of the training data is labeled.
We formulate the learning problem using constrained optimization, where each constraint bounds the performance of the model on labeled samples.
We show, via numerical experiments, that our proposed approach performs similarly to or better than state-of-the-art active learning methods.
arXiv Detail & Related papers (2022-02-08T19:18:49Z) - Neural Code Summarization: How Far Are We? [30.324396716447602]
Deep learning techniques have been exploited to automatically generate summaries for given code snippets.
In this paper, we conduct a systematic and in-depth analysis of five state-of-the-art neural source code summarization models.
arXiv Detail & Related papers (2021-07-15T04:33:59Z) - How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets.
Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z) - The Devil is in Classification: A Simple Framework for Long-tail Object
Detection and Instance Segmentation [93.17367076148348]
We investigate performance drop of the state-of-the-art two-stage instance segmentation model Mask R-CNN on the recent long-tail LVIS dataset.
We unveil that a major cause is the inaccurate classification of object proposals.
We propose a simple calibration framework to more effectively alleviate classification head bias with a bi-level class balanced sampling approach.
arXiv Detail & Related papers (2020-07-23T12:49:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.