CodeGen2: Lessons for Training LLMs on Programming and Natural Languages
- URL: http://arxiv.org/abs/2305.02309v2
- Date: Tue, 11 Jul 2023 21:11:23 GMT
- Title: CodeGen2: Lessons for Training LLMs on Programming and Natural Languages
- Authors: Erik Nijkamp, Hiroaki Hayashi, Caiming Xiong, Silvio Savarese, Yingbo
Zhou
- Abstract summary: We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
- Score: 116.74407069443895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have demonstrated remarkable abilities in
representation learning for program synthesis and understanding tasks. The
quality of the learned representations appears to be dictated by the neural
scaling laws as a function of the number of model parameters and observations,
while imposing upper bounds on the model performance by the amount of available
data and compute, which is costly.
In this study, we attempt to render the training of LLMs for program
synthesis more efficient by unifying four key components: (1) model
architectures, (2) learning methods, (3) infill sampling, and, (4) data
distributions. Specifically, for the model architecture, we attempt to unify
encoder and decoder-based models into a single prefix-LM. For learning methods,
(i) causal language modeling, (ii) span corruption, (iii) infilling are unified
into a simple learning algorithm. For infill sampling, we explore the claim of
a "free lunch" hypothesis. For data distributions, the effect of a mixture
distribution and multi-epoch training of programming and natural languages on
model performance is explored.
We conduct a comprehensive series of empirical experiments on 1B LLMs, for
which failures and successes of this exploration are distilled into five
lessons. We will provide a final recipe for training and release CodeGen2
models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training
framework as open-source: https://github.com/salesforce/CodeGen.
Related papers
- EmbedLLM: Learning Compact Representations of Large Language Models [28.49433308281983]
We propose EmbedLLM, a framework designed to learn compact vector representations of Large Language Models.
We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness.
Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency.
arXiv Detail & Related papers (2024-10-03T05:43:24Z) - Code Representation Learning At Scale [75.04686476303436]
We fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme.
We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language.
We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner.
arXiv Detail & Related papers (2024-02-02T22:19:15Z) - In-Context Language Learning: Architectures and Algorithms [73.93205821154605]
We study ICL through the lens of a new family of model problems we term in context language learning (ICLL)
We evaluate a diverse set of neural sequence models on regular ICLL tasks.
arXiv Detail & Related papers (2024-01-23T18:59:21Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners.
We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting.
Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.