What do pre-trained code models know about code?
- URL: http://arxiv.org/abs/2108.11308v1
- Date: Wed, 25 Aug 2021 16:20:17 GMT
- Title: What do pre-trained code models know about code?
- Authors: Anjan Karmakar, Romain Robbes
- Abstract summary: We use diagnostic tasks called probes to investigate pre-trained code models.
BERT (pre-trained on English), CodeBERT and CodeBERTa (pre-trained on source code, and natural language documentation), and GraphCodeBERT (pre-trained on source code with dataflow) are investigated.
- Score: 9.60966128833701
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained models of code built on the transformer architecture have
performed well on software engineering (SE) tasks such as predictive code
generation, code summarization, among others. However, whether the vector
representations from these pre-trained models comprehensively encode
characteristics of source code well enough to be applicable to a broad spectrum
of downstream tasks remains an open question.
One way to investigate this is with diagnostic tasks called probes. In this
paper, we construct four probing tasks (probing for surface-level, syntactic,
structural, and semantic information) for pre-trained code models. We show how
probes can be used to identify whether models are deficient in (understanding)
certain code properties, characterize different model layers, and get insight
into the model sample-efficiency.
We probe four models that vary in their expected knowledge of code
properties: BERT (pre-trained on English), CodeBERT and CodeBERTa (pre-trained
on source code, and natural language documentation), and GraphCodeBERT
(pre-trained on source code with dataflow). While GraphCodeBERT performs more
consistently overall, we find that BERT performs surprisingly well on some code
tasks, which calls for further investigation.
Related papers
- Does Your Neural Code Completion Model Use My Code? A Membership Inference Approach [66.51005288743153]
We investigate the legal and ethical issues of current neural code completion models.
We tailor a membership inference approach (termed CodeMI) that was originally crafted for classification tasks.
We evaluate the effectiveness of this adapted approach across a diverse array of neural code completion models.
arXiv Detail & Related papers (2024-04-22T15:54:53Z) - INSPECT: Intrinsic and Systematic Probing Evaluation for Code
Transformers [7.255653248042546]
We use a framework to define 15 probing tasks that exercise surface, syntactic, structural and semantic characteristics of source code.
We probe 8 pre-trained source code models, as well as a natural language model (BERT) as our baseline.
We find that models that incorporate some structural information (such as GraphCodeBERT) have a better representation of source code characteristics.
arXiv Detail & Related papers (2023-12-08T15:21:54Z) - Code Execution with Pre-trained Language Models [88.04688617516827]
Most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures.
We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution.
We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension.
arXiv Detail & Related papers (2023-05-08T10:00:05Z) - Enriching Source Code with Contextual Data for Code Completion Models:
An Empirical Study [4.438873396405334]
We aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion.
For comments, we find that the models perform better in the presence of multi-line comments.
arXiv Detail & Related papers (2023-04-24T17:09:14Z) - Towards Efficient Fine-tuning of Pre-trained Code Models: An
Experimental Study and Beyond [52.656743602538825]
Fine-tuning pre-trained code models incurs a large computational cost.
We conduct an experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning.
We propose Telly to efficiently fine-tune pre-trained code models via layer freezing.
arXiv Detail & Related papers (2023-04-11T13:34:13Z) - CodeExp: Explanatory Code Document Generation [94.43677536210465]
Existing code-to-text generation models produce only high-level summaries of code.
We conduct a human study to identify the criteria for high-quality explanatory docstring for code.
We present a multi-stage fine-tuning strategy and baseline models for the task.
arXiv Detail & Related papers (2022-11-25T18:05:44Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - Probing Pretrained Models of Source Code [14.904366372190943]
General pretrained models have been shown to outperform task-specific models in many applications.
We show that pretrained models of code indeed contain information about code syntactic structure and correctness, the notions of identifiers, data flow and correctnesss, and natural language naming.
arXiv Detail & Related papers (2022-02-16T10:26:14Z) - InferCode: Self-Supervised Learning of Code Representations by
Predicting Subtrees [17.461451218469062]
This paper proposes InferCode to overcome the limitation by adapting the self-language learning mechanism to build source code model.
Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction.
Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model.
arXiv Detail & Related papers (2020-12-13T10:33:41Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.