SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code
Translation
- URL: http://arxiv.org/abs/2310.15539v2
- Date: Fri, 15 Dec 2023 06:40:41 GMT
- Title: SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code
Translation
- Authors: Jialing Pan, Adrien Sad\'e, Jin Kim, Eric Soriano, Guillem Sole,
Sylvain Flamant
- Abstract summary: We introduce SteloCoder, a decoder-only StarCoder-based system for language-to-Python code translation.
SteloCoder achieves C++, C#, JavaScript, Java, or PHP-to-Python code translation without specifying the input programming language.
With experiments on XLCoST, SteloCoder achieves an average of 73.76 CodeBLEU score in multi-programming language-to-Python translation.
- Score: 1.7183449183902841
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the recent focus on Large Language Models (LLMs), both StarCoder (Li et
al., 2023) and Code Llama (Rozi\`ere et al., 2023) have demonstrated remarkable
performance in code generation. However, there is still a need for improvement
in code translation functionality with efficient training techniques. In
response to this, we introduce SteloCoder, a decoder-only StarCoder-based LLM
designed specifically for multi-programming language-to-Python code
translation. In particular, SteloCoder achieves C++, C#, JavaScript, Java, or
PHP-to-Python code translation without specifying the input programming
language. We modified StarCoder model architecture by incorporating a
Mixture-of-Experts (MoE) technique featuring five experts and a gating network
for multi-task handling. Experts are obtained by StarCoder fine-tuning.
Specifically, we use a Low-Rank Adaptive Method (LoRA) technique, limiting each
expert size as only 0.06% of number of StarCoder's parameters. At the same
time, to enhance training efficiency in terms of time, we adopt curriculum
learning strategy and use self-instruct data for efficient fine-tuning. As a
result, each expert takes only 6 hours to train on one single 80Gb A100 HBM.
With experiments on XLCoST datasets, SteloCoder achieves an average of 73.76
CodeBLEU score in multi-programming language-to-Python translation, surpassing
the top performance from the leaderboard by at least 3.5. This accomplishment
is attributed to only 45M extra parameters with StarCoder as the backbone and
32 hours of valid training on one 80GB A100 HBM. The source code is release
here: https://github.com/sade-adrien/SteloCoder.
Related papers
- CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - Code Llama: Open Foundation Models for Code [93.30115424203868]
We release Code Llama, a family of large language models for code based on Llama 2.
Code Llama reaches state-of-the-art performance among open models on several code benchmarks.
We release Code Llama under a permissive license that allows for both research and commercial use.
arXiv Detail & Related papers (2023-08-24T17:39:13Z) - Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs [2.9242435458494445]
This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data.
We apply this approach to generate tens of thousands of validated training items for Julia, Lua, OCaml, R, and Racket.
arXiv Detail & Related papers (2023-08-19T03:19:01Z) - CodeT5+: Open Code Large Language Models for Code Understanding and
Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence.
CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks.
We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z) - StarCoder: may the source be with you! [79.93915935620798]
The BigCode community introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length.
StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories.
arXiv Detail & Related papers (2023-05-09T08:16:42Z) - CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X [50.008474888951525]
We introduce CodeGeeX, a multilingual model with 13 billion parameters for code generation.
CodeGeeX is pre-trained on 850 billion tokens of 23 programming languages.
arXiv Detail & Related papers (2023-03-30T17:34:01Z) - SantaCoder: don't reach for the stars! [27.050410834027705]
The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code.
We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark.
Our best model outperforms previous open-source multilingual code generation models in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E.
arXiv Detail & Related papers (2023-01-09T10:52:35Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.