Benchmarking Language Models for Code Syntax Understanding
- URL: http://arxiv.org/abs/2210.14473v1
- Date: Wed, 26 Oct 2022 04:47:18 GMT
- Title: Benchmarking Language Models for Code Syntax Understanding
- Authors: Da Shen, Xinyun Chen, Chenguang Wang, Koushik Sen, Dawn Song
- Abstract summary: Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding.
In this work, we perform the first thorough benchmarking of the state-of-the-art pre-trained models for identifying the syntactic structures of programs.
Our findings point out key limitations of existing pre-training methods for programming languages, and suggest the importance of modeling code syntactic structures.
- Score: 79.11525961219591
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained language models have demonstrated impressive performance in both
natural language processing and program understanding, which represent the
input as a token sequence without explicitly modeling its structure. Some prior
works show that pre-trained language models can capture the syntactic rules of
natural languages without finetuning on syntax understanding tasks. However,
there is limited understanding of how well pre-trained models understand the
code structure so far. In this work, we perform the first thorough benchmarking
of the state-of-the-art pre-trained models for identifying the syntactic
structures of programs. Specifically, we introduce CodeSyntax, a large-scale
dataset of programs annotated with the syntactic relationships in their
corresponding abstract syntax trees. Our key observation is that existing
language models pretrained on code still lack the understanding of code syntax.
In fact, these pre-trained programming language models fail to match the
performance of simple baselines based on positional offsets and keywords. We
also present a natural language benchmark to highlight the differences between
natural languages and programming languages in terms of syntactic structure
understanding. Our findings point out key limitations of existing pre-training
methods for programming languages, and suggest the importance of modeling code
syntactic structures.
Related papers
- Structured Code Representations Enable Data-Efficient Adaptation of Code
Language Models [45.588949280419584]
We explore data-efficient adaptation of pre-trained code models by further pre-training and fine-tuning them with program structures.
Although the models that we adapt have been pre-trained only on the surface form of programs, we find that a small amount of continual pre-training and fine-tuning on CSTs without changing the model architecture yields improvements over the baseline approach across various code tasks.
arXiv Detail & Related papers (2024-01-19T14:27:44Z) - Wave to Syntax: Probing spoken language models for syntax [16.643072915927313]
We focus on the encoding of syntax in several self-supervised and visually grounded models of spoken language.
We show that syntax is captured most prominently in the middle layers of the networks, and more explicitly within models with more parameters.
arXiv Detail & Related papers (2023-05-30T11:43:18Z) - Physics of Language Models: Part 1, Learning Hierarchical Language Structures [51.68385617116854]
Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge.
We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences.
We demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it.
arXiv Detail & Related papers (2023-05-23T04:28:16Z) - On Robustness of Prompt-based Semantic Parsing with Large Pre-trained
Language Model: An Empirical Study on Codex [48.588772371355816]
This paper presents the first empirical study on the adversarial robustness of a large prompt-based language model of code, codex.
Our results demonstrate that the state-of-the-art (SOTA) code-language models are vulnerable to carefully crafted adversarial examples.
arXiv Detail & Related papers (2023-01-30T13:21:00Z) - BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and
Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing.
We benchmark eight language models, including two GPT-3 variants available only through an API.
Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z) - Few-shot Subgoal Planning with Language Models [58.11102061150875]
We show that language priors encoded in pre-trained language models allow us to infer fine-grained subgoal sequences.
In contrast to recent methods which make strong assumptions about subgoal supervision, our experiments show that language models can infer detailed subgoal sequences without any fine-tuning.
arXiv Detail & Related papers (2022-05-28T01:03:30Z) - Probing Linguistic Information For Logical Inference In Pre-trained
Language Models [2.4366811507669124]
We propose a methodology for probing linguistic information for logical inference in pre-trained language model representations.
We find that (i) pre-trained language models do encode several types of linguistic information for inference, but there are also some types of information that are weakly encoded.
We have demonstrated language models' potential as semantic and background knowledge bases for supporting symbolic inference methods.
arXiv Detail & Related papers (2021-12-03T07:19:42Z) - Constrained Language Models Yield Few-Shot Semantic Parsers [73.50960967598654]
We explore the use of large pretrained language models as few-shot semantics.
The goal in semantic parsing is to generate a structured meaning representation given a natural language input.
We use language models to paraphrase inputs into a controlled sublanguage resembling English that can be automatically mapped to a target meaning representation.
arXiv Detail & Related papers (2021-04-18T08:13:06Z) - Language-Agnostic Representation Learning of Source Code from Structure
and Context [43.99281651828355]
We propose a new model, which jointly learns on Context and Structure of source code.
We show that jointly training on non-parallel data from multiple programming languages improves results on all individual languages.
arXiv Detail & Related papers (2021-03-21T06:46:06Z) - Learning Music Helps You Read: Using Transfer to Study Linguistic
Structure in Language Models [27.91397366776451]
Training LSTMs on latent structure (MIDI music or Java code) improves test performance on natural language.
Experiments on transfer between natural languages controlling for vocabulary overlap show that zero-shot performance on a test language is highly correlated with typological similarity to the training language.
arXiv Detail & Related papers (2020-04-30T06:24:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.