CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model
- URL: http://arxiv.org/abs/2310.06266v2
- Date: Wed, 10 Jan 2024 19:59:03 GMT
- Title: CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model
- Authors: Peng Di, Jianguo Li, Hang Yu, Wei Jiang, Wenting Cai, Yang Cao, Chaoyu
Chen, Dajun Chen, Hongwei Chen, Liang Chen, Gang Fan, Jie Gong, Zi Gong, Wen
Hu, Tingting Guo, Zhichao Lei, Ting Li, Zheng Li, Ming Liang, Cong Liao,
Bingchang Liu, Jiachen Liu, Zhiwei Liu, Shaojun Lu, Min Shen, Guangpei Wang,
Huan Wang, Zhi Wang, Zhaogui Xu, Jiawei Yang, Qing Ye, Gehao Zhang, Yu Zhang,
Zelin Zhao, Xunjin Zheng, Hailian Zhou, Lifu Zhu, Xianying Zhu
- Abstract summary: This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM.
It is specifically designed for code-related tasks with both English and Chinese prompts.
CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
- Score: 58.127534002232096
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code Large Language Models (Code LLMs) have gained significant attention in
the industry due to their wide applications in the full lifecycle of software
engineering. However, the effectiveness of existing models in understanding
non-English inputs for multi-lingual code-related tasks is still far from well
studied. This paper introduces CodeFuse-13B, an open-sourced pre-trained code
LLM. It is specifically designed for code-related tasks with both English and
Chinese prompts and supports over 40 programming languages. CodeFuse achieves
its effectiveness by utilizing a high quality pre-training dataset that is
carefully filtered by program analyzers and optimized during the training
process. Extensive experiments are conducted using real-world usage scenarios,
the industry-standard benchmark HumanEval-x, and the specially designed
CodeFuseEval for Chinese prompts. To assess the effectiveness of CodeFuse, we
actively collected valuable human feedback from the AntGroup's software
development process where CodeFuse has been successfully deployed. The results
demonstrate that CodeFuse-13B achieves a HumanEval pass@1 score of 37.10%,
positioning it as one of the top multi-lingual code LLMs with similar parameter
sizes. In practical scenarios, such as code generation, code translation, code
comments, and testcase generation, CodeFuse performs better than other models
when confronted with Chinese prompts.
Related papers
- Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - Unraveling the Potential of Large Language Models in Code Translation: How Far Are We? [4.616570111453259]
Large language models (LLMs) exhibit state-of-the-art performance in various tasks, but struggle for code translation.
We conduct a large-scale empirical study to exploit the capabilities and incapabilities of LLMs in code translation tasks.
We propose two methods: (1) intermediary translation which selects an intermediary language between the source and target ones; and (2) self-training which fine-tunes LLMs on self-generated parallel data.
arXiv Detail & Related papers (2024-10-13T12:20:12Z) - CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - Large Language Models for cross-language code clone detection [3.5202378300682162]
Cross-lingual code clone detection has gained traction with the software engineering community.
Inspired by the significant advances in machine learning, this paper revisits cross-lingual code clone detection.
arXiv Detail & Related papers (2024-08-08T12:57:14Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - Exploring Large Language Models for Code Explanation [3.2570216147409514]
Large Language Models (LLMs) have made remarkable strides in Natural Language Processing.
This study specifically delves into the task of generating natural-language summaries for code snippets, using various LLMs.
arXiv Detail & Related papers (2023-10-25T14:38:40Z) - LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z) - ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for
Programming Languages [37.60016772021422]
Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa.
Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric.
We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs.
arXiv Detail & Related papers (2022-12-13T17:21:44Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.