Related papers: Optimizing Datasets for Code Summarization: Is Code-Comment Coherence Enough?

Optimizing Datasets for Code Summarization: Is Code-Comment Coherence Enough?

URL: http://arxiv.org/abs/2502.07611v1
Date: Tue, 11 Feb 2025 15:02:19 GMT
Title: Optimizing Datasets for Code Summarization: Is Code-Comment Coherence Enough?
Authors: Antonio Vitale, Antonio Mastropaolo, Rocco Oliveto, Massimiliano Di Penta, Simone Scalabrino,
Abstract summary: We explore the extent to which code-comment coherence, a specific quality attribute of code summaries, can be used to optimize code summarization datasets.<n>We examine multiple levels of training instances from two state-of-the-art datasets (TL-CodeSum and Funcom) and evaluate the resulting models on three manually curated test sets.
Score: 11.865113785648932
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated code summarization is a long-standing goal for code comprehension. This task automatically generates documentation using a given method. Deep Learning (DL)-based approaches have been proven beneficial for various software engineering (SE) tasks, including this one. Most state-of-the-art datasets for code summarization are automatically mined from GitHub and, thus, might contain erroneous or sub-optimal examples. Previous work showed that using a simple rule-based approach for removing noisy instances allows for a tangible reduction of the training set size while not reducing the effectiveness of the trained models. Motivated by this finding, we conjecture that it is possible to further reduce the dataset size by removing instances that contain different issues. In this paper, we explore the extent to which code-comment coherence, a specific quality attribute of code summaries, can be used to optimize code summarization datasets. Specifically, we hypothesize that removing incoherent code-comment pairs might positively impact the effectiveness of the models. To do this, we rely on SIDE, a recently introduced metric for code-summary coherence. We examine multiple selectivity levels of training instances from two state-of-the-art datasets (TL-CodeSum and Funcom) and evaluate the resulting models on three manually curated test sets. The results show that even halving the training set sizes does not significantly affect the model's ability to generate summaries. However, when comparing the most restrictive selection strategy with a simpler one that randomly selects the training instances, we observe that the resulting accuracy of the model also does not change. This result suggests that (i) current datasets contain many irrelevant examples, and (ii) different quality attributes should be explored for optimizing code summarization datasets.

Related papers

Rethinking End-to-End 2D to 3D Scene Segmentation in Gaussian Splatting [86.15347226865826]
We design a new end-to-end object-aware lifting approach, named Unified-Lift. We augment each Gaussian point with an additional Gaussian-level feature learned using a contrastive loss to encode instance information. We conduct experiments on three benchmarks: LERF-Masked, Replica, and Messy Rooms.
arXiv Detail & Related papers (2025-03-18T08:42:23Z)
Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures [0.0]
This research introduces a novel ensemble learning approach for code similarity assessment. The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses.
arXiv Detail & Related papers (2024-05-03T13:42:49Z)
GistScore: Learning Better Representations for In-Context Example Selection with Gist Bottlenecks [3.9638110494107095]
In-context Learning (ICL) is the ability of Large Language Models (LLMs) to perform new tasks when conditioned on prompts. We propose Example Gisting, a novel approach for training example encoders through supervised fine-tuning. We show that our fine-tuned models get state-of-the-art ICL performance with over 20% absolute gain over off-the-shelf retrievers.
arXiv Detail & Related papers (2023-11-16T06:28:05Z)
Rethinking Negative Pairs in Code Search [56.23857828689406]
We propose a simple yet effective Soft-InfoNCE loss that inserts weight terms into InfoNCE. We analyze the effects of Soft-InfoNCE on controlling the distribution of learnt code representations and on deducing a more precise mutual information estimation.
arXiv Detail & Related papers (2023-10-12T06:32:42Z)
BaSAL: Size-Balanced Warm Start Active Learning for LiDAR Semantic Segmentation [2.9290232815049926]
Existing active learning methods overlook the severe class imbalance inherent in LiDAR semantic segmentation datasets. We propose BaSAL, a size-balanced warm start active learning model, based on the observation that each object class has a characteristic size. Results show that we are able to improve the performance of the initial model by a large margin.
arXiv Detail & Related papers (2023-10-12T05:03:19Z)
Boosting Commit Classification with Contrastive Learning [0.8655526882770742]
Commit Classification (CC) is an important task in software maintenance. We propose a contrastive learning-based commit classification framework. Our framework can solve the CC problem simply but effectively in fewshot scenarios.
arXiv Detail & Related papers (2023-08-16T10:02:36Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
RetICL: Sequential Retrieval of In-Context Examples with Reinforcement Learning [53.52699766206808]
We propose Retrieval for In-Context Learning (RetICL), a learnable method for modeling and optimally selecting examples sequentially for in-context learning. We evaluate RetICL on math word problem solving and scientific question answering tasks and show that it consistently outperforms or matches and learnable baselines.
arXiv Detail & Related papers (2023-05-23T20:15:56Z)
CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code. We conduct a human study to identify the criteria for high-quality explanatory docstring for code. We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z)
A Lagrangian Duality Approach to Active Learning [119.36233726867992]
We consider the batch active learning problem, where only a subset of the training data is labeled. We formulate the learning problem using constrained optimization, where each constraint bounds the performance of the model on labeled samples. We show, via numerical experiments, that our proposed approach performs similarly to or better than state-of-the-art active learning methods.
arXiv Detail & Related papers (2022-02-08T19:18:49Z)
Neural Code Summarization: How Far Are We? [30.324396716447602]
Deep learning techniques have been exploited to automatically generate summaries for given code snippets. In this paper, we conduct a systematic and in-depth analysis of five state-of-the-art neural source code summarization models.
arXiv Detail & Related papers (2021-07-15T04:33:59Z)
How to distribute data across tasks for meta-learning? [59.608652082495624]
We show that the optimal number of data points per task depends on the budget, but it converges to a unique constant value for large budgets. Our results suggest a simple and efficient procedure for data collection.
arXiv Detail & Related papers (2021-03-15T15:38:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.