SantaCoder: don't reach for the stars!
- URL: http://arxiv.org/abs/2301.03988v1
- Date: Mon, 9 Jan 2023 10:52:35 GMT
- Title: SantaCoder: don't reach for the stars!
- Authors: Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou,
Christopher Akiki, Carlos Munoz Ferrandis, Niklas Muennighoff, Mayank Mishra,
Alex Gu, Manan Dey, Logesh Kumar Umapathi, Carolyn Jane Anderson, Yangtian
Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin, Dmitry Abulkhanov,
Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo Garc\'ia del
R\'io, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu,
Paulo Villegas, Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen,
Danish Contractor, Luis Villa, Jia Li, Dzmitry Bahdanau, Yacine Jernite, Sean
Hughes, Daniel Fried, Arjun Guha, Harm de Vries, Leandro von Werra
- Abstract summary: The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code.
We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark.
Our best model outperforms previous open-source multilingual code generation models in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E.
- Score: 27.050410834027705
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The BigCode project is an open-scientific collaboration working on the
responsible development of large language models for code. This tech report
describes the progress of the collaboration until December 2022, outlining the
current state of the Personally Identifiable Information (PII) redaction
pipeline, the experiments conducted to de-risk the model architecture, and the
experiments investigating better preprocessing methods for the training data.
We train 1.1B parameter models on the Java, JavaScript, and Python subsets of
The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find
that more aggressive filtering of near-duplicates can further boost performance
and, surprisingly, that selecting files from repositories with 5+ GitHub stars
deteriorates performance significantly. Our best model outperforms previous
open-source multilingual code generation models (InCoder-6.7B and
CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java,
JavaScript, and Python portions of MultiPL-E, despite being a substantially
smaller model. All models are released under an OpenRAIL license at
https://hf.co/bigcode.
Related papers
- StarCoder 2 and The Stack v2: The Next Generation [105.93298676368798]
We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens.
We thoroughly evaluate them on a comprehensive set of Code LLM benchmarks.
Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size.
arXiv Detail & Related papers (2024-02-29T13:53:35Z) - WizardCoder: Empowering Code Large Language Models with Evol-Instruct [67.24653703564492]
We introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning.
Our model surpasses all other open-source Code LLMs by a substantial margin.
arXiv Detail & Related papers (2023-06-14T15:18:48Z) - StarCoder: may the source be with you! [79.93915935620798]
The BigCode community introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length.
StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories.
arXiv Detail & Related papers (2023-05-09T08:16:42Z) - Enriching Source Code with Contextual Data for Code Completion Models:
An Empirical Study [4.438873396405334]
We aim to answer whether making code easier to understand through using contextual data improves the performance of pre-trained code language models for the task of code completion.
For comments, we find that the models perform better in the presence of multi-line comments.
arXiv Detail & Related papers (2023-04-24T17:09:14Z) - CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X [50.008474888951525]
We introduce CodeGeeX, a multilingual model with 13 billion parameters for code generation.
CodeGeeX is pre-trained on 850 billion tokens of 23 programming languages.
arXiv Detail & Related papers (2023-03-30T17:34:01Z) - InCoder: A Generative Model for Code Infilling and Synthesis [88.46061996766348]
We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) and editing (via infilling)
InCoder is trained to generate code files from a large corpus of permissively licensed code.
Our model is the first generative model that is able to directly perform zero-shot code infilling.
arXiv Detail & Related papers (2022-04-12T16:25:26Z) - A Systematic Evaluation of Large Language Models of Code [88.34057460577957]
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions.
The current state-of-the-art code LMs are not publicly available, leaving many questions about their model and data design decisions.
Although Codex is not open-source, we find that existing open-source models do achieve close results in some programming languages.
We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine.
arXiv Detail & Related papers (2022-02-26T15:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.