CodeBERT: A Pre-Trained Model for Programming and Natural Languages
- URL: http://arxiv.org/abs/2002.08155v4
- Date: Fri, 18 Sep 2020 15:38:12 GMT
- Title: CodeBERT: A Pre-Trained Model for Programming and Natural Languages
- Authors: Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming
Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, Ming Zhou
- Abstract summary: CodeBERT is a pre-trained model for programming language (PL) and nat-ural language (NL)
We develop CodeBERT with Transformer-based neural architecture.
We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters.
- Score: 117.34242908773061
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present CodeBERT, a bimodal pre-trained model for programming language
(PL) and nat-ural language (NL). CodeBERT learns general-purpose
representations that support downstream NL-PL applications such as natural
language codesearch, code documentation generation, etc. We develop CodeBERT
with Transformer-based neural architecture, and train it with a hybrid
objective function that incorporates the pre-training task of replaced token
detection, which is to detect plausible alternatives sampled from generators.
This enables us to utilize both bimodal data of NL-PL pairs and unimodal data,
where the former provides input tokens for model training while the latter
helps to learn better generators. We evaluate CodeBERT on two NL-PL
applications by fine-tuning model parameters. Results show that CodeBERT
achieves state-of-the-art performance on both natural language code search and
code documentation generation tasks. Furthermore, to investigate what type of
knowledge is learned in CodeBERT, we construct a dataset for NL-PL probing, and
evaluate in a zero-shot setting where parameters of pre-trained models are
fixed. Results show that CodeBERT performs better than previous pre-trained
models on NL-PL probing.
Related papers
- CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - Summarize and Generate to Back-translate: Unsupervised Translation of
Programming Languages [86.08359401867577]
Back-translation is widely known for its effectiveness for neural machine translation when little to no parallel data is available.
We propose performing back-translation via code summarization and generation.
We show that our proposed approach performs competitively with state-of-the-art methods.
arXiv Detail & Related papers (2022-05-23T08:20:41Z) - An Exploratory Study on Code Attention in BERT [8.488193857572211]
We investigate the attention behavior of PLM on code and compare it with natural language.
We show that BERT pays more attention to syntactic entities, specifically identifiers and separators, in contrast to the most attended token in NLP.
The findings can benefit the research community by using code-specific representations instead of applying the common embeddings used in NLP.
arXiv Detail & Related papers (2022-04-05T21:23:10Z) - AstBERT: Enabling Language Model for Code Understanding with Abstract
Syntax Tree [3.1087379479634927]
We propose the AstBERT model, a pre-trained language model aiming to better understand the programming language (PL) using the abstract syntax tree (AST)
Specifically, we collect a colossal amount of source codes (both java and python) from GitHub, in which information of the source codes can be interpreted and integrated.
Experiment results show that our AstBERT model achieves state-of-the-art performance on both downstream tasks.
arXiv Detail & Related papers (2022-01-20T03:27:26Z) - Fine-tuning BERT for Low-Resource Natural Language Understanding via
Active Learning [30.5853328612593]
In this work, we explore fine-tuning methods of BERT -- a pre-trained Transformer based language model.
Our experimental results show an advantage in model performance by maximizing the approximate knowledge gain of the model.
We analyze the benefits of freezing layers of the language model during fine-tuning to reduce the number of trainable parameters.
arXiv Detail & Related papers (2020-12-04T08:34:39Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z) - Incorporating External Knowledge through Pre-training for Natural
Language to Code Generation [97.97049697457425]
Open-domain code generation aims to generate code in a general-purpose programming language from natural language (NL) intents.
We explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation.
Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa.
arXiv Detail & Related papers (2020-04-20T01:45:27Z) - Exploring Neural Models for Parsing Natural Language into First-Order
Logic [10.62143644603835]
We study the capability of neural models in parsing English sentences to First-Order Logic (FOL)
We model FOL parsing as a sequence to sequence mapping task where given a natural language sentence, it is encoded into an intermediate representation using an LSTM followed by a decoder which sequentially generates the predicates in the corresponding FOL formula.
arXiv Detail & Related papers (2020-02-16T09:22:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.