StarCoder 2 and The Stack v2: The Next Generation
- URL: http://arxiv.org/abs/2402.19173v1
- Date: Thu, 29 Feb 2024 13:53:35 GMT
- Title: StarCoder 2 and The Stack v2: The Next Generation
- Authors: Anton Lozhkov, Raymond Li, Loubna Ben Allal, Federico Cassano, Joel
Lamy-Poirier, Nouamane Tazi, Ao Tang, Dmytro Pykhtar, Jiawei Liu, Yuxiang
Wei, Tianyang Liu, Max Tian, Denis Kocetkov, Arthur Zucker, Younes Belkada,
Zijian Wang, Qian Liu, Dmitry Abulkhanov, Indraneil Paul, Zhuang Li, Wen-Ding
Li, Megan Risdal, Jia Li, Jian Zhu, Terry Yue Zhuo, Evgenii Zheltonozhskii,
Nii Osae Osae Dade, Wenhao Yu, Lucas Krau{\ss}, Naman Jain, Yixuan Su, Xuanli
He, Manan Dey, Edoardo Abati, Yekun Chai, Niklas Muennighoff, Xiangru Tang,
Muhtasham Oblokulov, Christopher Akiki, Marc Marone, Chenghao Mou, Mayank
Mishra, Alex Gu, Binyuan Hui, Tri Dao, Armel Zebaze, Olivier Dehaene, Nicolas
Patry, Canwen Xu, Julian McAuley, Han Hu, Torsten Scholak, Sebastien Paquet,
Jennifer Robinson, Carolyn Jane Anderson, Nicolas Chapados, Mostofa Patwary,
Nima Tajbakhsh, Yacine Jernite, Carlos Mu\~noz Ferrandis, Lingming Zhang,
Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, Harm de Vries
- Abstract summary: We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens.
We thoroughly evaluate them on a comprehensive set of Code LLM benchmarks.
Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size.
- Score: 105.93298676368798
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The BigCode project, an open-scientific collaboration focused on the
responsible development of Large Language Models for Code (Code LLMs),
introduces StarCoder2. In partnership with Software Heritage (SWH), we build
The Stack v2 on top of the digital commons of their source code archive.
Alongside the SWH repositories spanning 619 programming languages, we carefully
select other high-quality data sources, such as GitHub pull requests, Kaggle
notebooks, and code documentation. This results in a training set that is 4x
larger than the first StarCoder dataset. We train StarCoder2 models with 3B,
7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate
them on a comprehensive set of Code LLM benchmarks. We find that our small
model, StarCoder2-3B, outperforms other Code LLMs of similar size on most
benchmarks, and also outperforms StarCoderBase-15B. Our large model,
StarCoder2- 15B, significantly outperforms other models of comparable size. In
addition, it matches or outperforms CodeLlama-34B, a model more than twice its
size. Although DeepSeekCoder- 33B is the best-performing model at code
completion for high-resource languages, we find that StarCoder2-15B outperforms
it on math and code reasoning benchmarks, as well as several low-resource
languages. We make the model weights available under an OpenRAIL license and
ensure full transparency regarding the training data by releasing the SoftWare
Heritage persistent IDentifiers (SWHIDs) of the source code data.
Related papers
- VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data [64.69872638349922]
We present AlchemistCoder, a series of Code LLMs with enhanced code generation and generalization capabilities fine-tuned on multi-source data.
We propose incorporating the data construction process into the fine-tuning data as code comprehension tasks, including instruction evolution, data filtering, and code review.
arXiv Detail & Related papers (2024-05-29T16:57:33Z) - DeepSeek-Coder: When the Large Language Model Meets Programming -- The
Rise of Code Intelligence [42.517055368627226]
We introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens.
Our evaluations demonstrate that DeepSeek-Coder achieves state-of-the-art performance among open-source code models across multiple benchmarks.
DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.
arXiv Detail & Related papers (2024-01-25T14:17:53Z) - BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models [27.772192759716116]
We present BioCoder, a benchmark developed to evaluate large language models (LLMs) in generating bioinformatics-specific code.
BioCoder spans much of the field, covering cross-file dependencies, class declarations, and global variables.
We show that the overall coverage of the included code is representative of the full spectrum of bioinformatics calculations.
arXiv Detail & Related papers (2023-08-31T04:52:58Z) - WizardCoder: Empowering Code Large Language Models with Evol-Instruct [67.24653703564492]
We introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning.
Our model surpasses all other open-source Code LLMs by a substantial margin.
arXiv Detail & Related papers (2023-06-14T15:18:48Z) - StarCoder: may the source be with you! [79.93915935620798]
The BigCode community introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length.
StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories.
arXiv Detail & Related papers (2023-05-09T08:16:42Z) - SantaCoder: don't reach for the stars! [27.050410834027705]
The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code.
We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark.
Our best model outperforms previous open-source multilingual code generation models in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E.
arXiv Detail & Related papers (2023-01-09T10:52:35Z) - A Systematic Evaluation of Large Language Models of Code [88.34057460577957]
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions.
The current state-of-the-art code LMs are not publicly available, leaving many questions about their model and data design decisions.
Although Codex is not open-source, we find that existing open-source models do achieve close results in some programming languages.
We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine.
arXiv Detail & Related papers (2022-02-26T15:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.