Learning code summarization from a small and local dataset
- URL: http://arxiv.org/abs/2206.00804v1
- Date: Thu, 2 Jun 2022 00:16:03 GMT
- Title: Learning code summarization from a small and local dataset
- Authors: Toufique Ahmed and Premkumar Devanbu
- Abstract summary: Training on project-specific data, and testing on the same project, is a promising idea.
We compare several models and training approaches, including same-project training, cross-project training, training a model especially designed to be sample efficient.
We find that the maximalist hybrid setting provides consistent, substantial gains over the state-of-the-art.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Foundation models (e.g., CodeBERT, GraphCodeBERT, CodeT5) work well for many
software engineering tasks. These models are pre-trained (using
self-supervision) with billions of code tokens, and then fine-tuned with
hundreds of thousands of labeled examples, typically drawn from many projects.
However, software phenomena can be very project-specific. Vocabulary, and other
phenomena vary substantially with each project. Thus, training on
project-specific data, and testing on the same project, is a promising idea.
This hypothesis has to be evaluated carefully, e.g., in a time-series setting,
to prevent training-test leakage. We compare several models and training
approaches, including same-project training, cross-project training, training a
model especially designed to be sample efficient (and thus prima facie
well-suited for learning in a limited-sample same-project setting) and a
maximalist hybrid approach, fine-tuning first on many projects in many
languages and then training on the same-project. We find that the maximalist
hybrid setting provides consistent, substantial gains over the
state-of-the-art, on many different projects in both Java and Python.
Related papers
- ParallelSpec: Parallel Drafter for Efficient Speculative Decoding [62.68430939686566]
We present ParallelSpec, an alternative to auto-regressive drafting strategies in state-of-the-art speculative decoding approaches.
In contrast to auto-regressive drafting in the speculative stage, we train a parallel drafter to serve as an efficient speculative model.
arXiv Detail & Related papers (2024-10-08T01:05:08Z) - Pre-Trained Vision-Language Models as Partial Annotators [40.89255396643592]
Pre-trained vision-language models learn massive data to model unified representations of images and natural languages.
In this paper, we investigate a novel "pre-trained annotating - weakly-supervised learning" paradigm for pre-trained model application and experiment on image classification tasks.
arXiv Detail & Related papers (2024-05-23T17:17:27Z) - DevEval: Evaluating Code Generation in Practical Software Projects [52.16841274646796]
We propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects.
DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects.
We assess five popular LLMs on DevEval and reveal their actual abilities in code generation.
arXiv Detail & Related papers (2024-01-12T06:51:30Z) - Few-shot learning for sentence pair classification and its applications
in software engineering [0.36832029288386137]
This work is to investigate the performance of alternative few-shot learning approaches with BERT-based models.
vanilla fine-tuning, PET and SetFit are compared for numerous BERT-based checkpoints over an array of training set sizes.
Our results establish PET as a strong few-shot learning approach, and our analysis shows that with just a few hundred labeled examples it can achieve performance near that of fine-tuning on full-sized data sets.
arXiv Detail & Related papers (2023-06-13T18:23:52Z) - TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Few-shot training LLMs for project-specific code-summarization [0.0]
We investigate the use few-shot training with the very large GPT (Generative Pre-trained Transformer) Codex model.
We find evidence suggesting that one can significantly surpass state-of-the-art models for code-summarization.
arXiv Detail & Related papers (2022-07-09T09:57:11Z) - Assessing Project-Level Fine-Tuning of ML4SE Models [3.699097874146491]
We investigate how the model's quality can be improved if we target a specific project.
We evaluate three models of different complexity and compare their quality in three settings.
Per-project fine-tuning can greatly improve the models' quality as they capture the project's domain and naming conventions.
arXiv Detail & Related papers (2022-06-07T14:19:33Z) - Multi-Task Self-Training for Learning General Representations [97.01728635294879]
Multi-task self-training (MuST) harnesses the knowledge in independent specialized teacher models to train a single general student model.
MuST is scalable with unlabeled or partially labeled datasets and outperforms both specialized supervised models and self-supervised models when training on large scale datasets.
arXiv Detail & Related papers (2021-08-25T17:20:50Z) - How to Design Sample and Computationally Efficient VQA Models [53.65668097847456]
We find that representing the text as probabilistic programs and images as object-level scene graphs best satisfy these desiderata.
We extend existing models to leverage these soft programs and scene graphs to train on question answer pairs in an end-to-end manner.
arXiv Detail & Related papers (2021-03-22T01:48:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.