Related papers: Learning code summarization from a small and local dataset

Learning code summarization from a small and local dataset

URL: http://arxiv.org/abs/2206.00804v1
Date: Thu, 2 Jun 2022 00:16:03 GMT
Title: Learning code summarization from a small and local dataset
Authors: Toufique Ahmed and Premkumar Devanbu
Abstract summary: Training on project-specific data, and testing on the same project, is a promising idea. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient. We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many software engineering tasks. These models are pre-trained (using self-supervision) with billions of code tokens, and then fine-tuned with hundreds of thousands of labeled examples, typically drawn from many projects. However, software phenomena can be very project-specific. Vocabulary, and other phenomena vary substantially with each project. Thus, training on project-specific data, and testing on the same project, is a promising idea. This hypothesis has to be evaluated carefully, e.g., in a time-series setting, to prevent training-test leakage. We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient (and thus prima facie well-suited for learning in a limited-sample same-project setting) and a maximalist hybrid approach, fine-tuning first on many projects in many languages and then training on the same-project. We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art, on many different projects in both Java and Python.

Related papers

UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches. In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z)
Pre-Trained Vision-Language Models as Partial Annotators [40.89255396643592]
Pre-trained vision-language models learn massive data to model unified representations of images and natural languages. In this paper, we investigate a novel "pre-trained annotating - weakly-supervised learning" paradigm for pre-trained model application and experiment on image classification tasks.
arXiv Detail & Related papers (2024-05-23T17:17:27Z)
DevEval: Evaluating Code Generation in Practical Software Projects [52.16841274646796]
We propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects. DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects. We assess five popular LLMs on DevEval and reveal their actual abilities in code generation.
arXiv Detail & Related papers (2024-01-12T06:51:30Z)
Few-shot learning for sentence pair classification and its applications in software engineering [0.36832029288386137]
This work is to investigate the performance of alternative few-shot learning approaches with BERT-based models. vanilla fine-tuning, PET and SetFit are compared for numerous BERT-based checkpoints over an array of training set sizes. Our results establish PET as a strong few-shot learning approach, and our analysis shows that with just a few hundred labeled examples it can achieve performance near that of fine-tuning on full-sized data sets.
arXiv Detail & Related papers (2023-06-13T18:23:52Z)
TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
Few-shot training LLMs for project-specific code-summarization [0.0]
We investigate the use few-shot training with the very large GPT (Generative Pre-trained Transformer) Codex model. We find evidence suggesting that one can significantly surpass state-of-the-art models for code-summarization.
arXiv Detail & Related papers (2022-07-09T09:57:11Z)
Assessing Project-Level Fine-Tuning of ML4SE Models [3.699097874146491]
We investigate how the model's quality can be improved if we target a specific project. We evaluate three models of different complexity and compare their quality in three settings. Per-project fine-tuning can greatly improve the models' quality as they capture the project's domain and naming conventions.
arXiv Detail & Related papers (2022-06-07T14:19:33Z)
Multi-Task Self-Training for Learning General Representations [97.01728635294879]
Multi-task self-training (MuST) harnesses the knowledge in independent specialized teacher models to train a single general student model. MuST is scalable with unlabeled or partially labeled datasets and outperforms both specialized supervised models and self-supervised models when training on large scale datasets.
arXiv Detail & Related papers (2021-08-25T17:20:50Z)
How to Design Sample and Computationally Efficient VQA Models [53.65668097847456]
We find that representing the text as probabilistic programs and images as object-level scene graphs best satisfy these desiderata. We extend existing models to leverage these soft programs and scene graphs to train on question answer pairs in an end-to-end manner.
arXiv Detail & Related papers (2021-03-22T01:48:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.