Code Representation Learning At Scale
- URL: http://arxiv.org/abs/2402.01935v1
- Date: Fri, 2 Feb 2024 22:19:15 GMT
- Title: Code Representation Learning At Scale
- Authors: Dejiao Zhang, Wasi Ahmad, Ming Tan, Hantian Ding, Ramesh Nallapati,
Dan Roth, Xiaofei Ma, Bing Xiang
- Abstract summary: We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
- Score: 75.04686476303436
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent studies have shown that code language models at scale demonstrate
significant performance gains on downstream tasks, i.e., code generation.
However, most of the existing works on code representation learning train
models at a hundred million parameter scale using very limited pretraining
corpora. In this work, we fuel code representation learning with a vast amount
of code data via a two-stage pretraining scheme. We first train the encoders
via a mix that leverages both randomness in masking language modeling and the
structure aspect of programming language. We then enhance the representations
via contrastive learning with hard negative and hard positive constructed in an
unsupervised manner. We establish an off-the-shelf encoder model that
persistently outperforms the existing models on a wide variety of downstream
tasks by large margins. To comprehend the factors contributing to successful
code representation learning, we conduct detailed ablations and share our
findings on (i) a customized and effective token-level denoising scheme for
source code; (ii) the importance of hard negatives and hard positives; (iii)
how the proposed bimodal contrastive learning boost the cross-lingual semantic
search performance; and (iv) how the pretraining schemes decide the downstream
task performance scales with the model size.
Related papers
- DemoCraft: Using In-Context Learning to Improve Code Generation in Large Language Models [0.0]
We propose DemoCraft, which enhances code generation by leveraging in-context learning and demonstration selection.
Latent concept learning introduces additional concept tokens, which are trainable embeddings that capture task-specific knowledge.
Our experimental results demonstrate that the proposed system achieves an approximate 2x increase in the pass@k metric.
Our empirical studies indicate that our system attains nearly a 3x improvement in these metrics as well.
arXiv Detail & Related papers (2024-10-30T19:45:50Z) - EmbedLLM: Learning Compact Representations of Large Language Models [28.49433308281983]
We propose EmbedLLM, a framework designed to learn compact vector representations of Large Language Models.
We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness.
Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency.
arXiv Detail & Related papers (2024-10-03T05:43:24Z) - Collaborative decoding of critical tokens for boosting factuality of
large language models [57.504894664689]
Finetuned and aligned models show improved abilities of instruction following and safe generation.
The common practice of using sampling during generation also increases chances of hallucination.
We introduce a collaborative decoding framework to harness the high factuality within pretrained models through the concept of critical tokens.
arXiv Detail & Related papers (2024-02-28T01:53:37Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - Enriching Source Code with Contextual Data for Code Completion Models:
An Empirical Study [4.438873396405334]
We aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion.
For comments, we find that the models perform better in the presence of multi-line comments.
arXiv Detail & Related papers (2023-04-24T17:09:14Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.