Structured Code Representations Enable Data-Efficient Adaptation of Code
Language Models
- URL: http://arxiv.org/abs/2401.10716v1
- Date: Fri, 19 Jan 2024 14:27:44 GMT
- Title: Structured Code Representations Enable Data-Efficient Adaptation of Code
Language Models
- Authors: Mayank Agarwal, Yikang Shen, Bailin Wang, Yoon Kim, Jie Chen
- Abstract summary: We explore data-efficient adaptation of pre-trained code models by further pre-training and fine-tuning them with program structures.
Although the models that we adapt have been pre-trained only on the surface form of programs, we find that a small amount of continual pre-training and fine-tuning on CSTs without changing the model architecture yields improvements over the baseline approach across various code tasks.
- Score: 45.588949280419584
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current language models tailored for code tasks often adopt the
pre-training-then-fine-tuning paradigm from natural language processing,
modeling source code as plain text. This approach, however, overlooks the
unambiguous structures inherent in programming languages. In this work, we
explore data-efficient adaptation of pre-trained code models by further
pre-training and fine-tuning them with program structures. Specifically, we
represent programs as parse trees -- also known as concrete syntax trees (CSTs)
-- and adapt pre-trained models on serialized CSTs. Although the models that we
adapt have been pre-trained only on the surface form of programs, we find that
a small amount of continual pre-training and fine-tuning on CSTs without
changing the model architecture yields improvements over the baseline approach
across various code tasks. The improvements are found to be particularly
significant when there are limited training examples, demonstrating the
effectiveness of integrating program structures with plain-text representation
even when working with backbone models that have not been pre-trained with
structures.
Related papers
- Text-to-Code Generation with Modality-relative Pre-training [6.546893206010636]
We investigate how sequence tokens can be adapted depending on which modality they belong to.
We focus on text-to-code generation and observe consistent improvements across two backbone models and two test sets.
arXiv Detail & Related papers (2024-02-08T16:17:24Z) - Physics of Language Models: Part 1, Learning Hierarchical Language Structures [51.68385617116854]
Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge.
We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences.
We demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it.
arXiv Detail & Related papers (2023-05-23T04:28:16Z) - Better Language Models of Code through Self-Improvement [18.75015225501755]
We propose a simple data augmentation framework for pre-trained language models for code (PLMCs)
Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to generate pseudo data, which is then used as training data for the next step.
The results show that our framework significantly improves PLMCs' performance in code-related sequence generation tasks.
arXiv Detail & Related papers (2023-04-02T10:59:19Z) - Autoregressive Structured Prediction with Language Models [73.11519625765301]
We describe an approach to model structures as sequences of actions in an autoregressive manner with PLMs.
Our approach achieves the new state-of-the-art on all the structured prediction tasks we looked at.
arXiv Detail & Related papers (2022-10-26T13:27:26Z) - Benchmarking Language Models for Code Syntax Understanding [79.11525961219591]
Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding.
In this work, we perform the first thorough benchmarking of the state-of-the-art pre-trained models for identifying the syntactic structures of programs.
Our findings point out key limitations of existing pre-training methods for programming languages, and suggest the importance of modeling code syntactic structures.
arXiv Detail & Related papers (2022-10-26T04:47:18Z) - DeepStruct: Pretraining of Language Models for Structure Prediction [64.84144849119554]
We pretrain language models on a collection of task-agnostic corpora to generate structures from text.
Our structure pretraining enables zero-shot transfer of the learned knowledge that models have about the structure tasks.
We show that a 10B parameter language model transfers non-trivially to most tasks and obtains state-of-the-art performance on 21 of 28 datasets.
arXiv Detail & Related papers (2022-05-21T00:58:22Z) - Synchromesh: Reliable code generation from pre-trained language models [38.15391794443022]
We propose Synchromesh: a framework for substantially improving the reliability of pre-trained models for code generation.
First, it retrieves few-shot examples from a training bank using Target Similarity Tuning (TST), a novel method for semantic example selection.
Then, Synchromesh feeds the examples to a pre-trained language model and samples programs using Constrained Semantic Decoding (CSD), a general framework for constraining the output to a set of valid programs in the target language.
arXiv Detail & Related papers (2022-01-26T22:57:44Z) - Pre-Training a Language Model Without Human Language [74.11825654535895]
We study how the intrinsic nature of pre-training data contributes to the fine-tuned downstream performance.
We find that models pre-trained on unstructured data beat those trained directly from scratch on downstream tasks.
To our great astonishment, we uncover that pre-training on certain non-human language data gives GLUE performance close to performance pre-trained on another non-English language.
arXiv Detail & Related papers (2020-12-22T13:38:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.