Related papers: MultiCoder: Multi-Programming-Lingual Pre-Training for Low-Resource Code Completion

MultiCoder: Multi-Programming-Lingual Pre-Training for Low-Resource Code Completion

URL: http://arxiv.org/abs/2212.09666v1
Date: Mon, 19 Dec 2022 17:50:05 GMT
Title: MultiCoder: Multi-Programming-Lingual Pre-Training for Low-Resource Code Completion
Authors: Zi Gong, Yinpeng Guo, Pingyi Zhou, Cuiyun Gao, Yasheng Wang, Zenglin Xu
Abstract summary: We propose the MultiCoder to enhance the low-resource code completion via MultiPL pre-training and MultiPL Mixture-of-Experts layers. We also propose a novel PL-level MoE routing strategy (PL-MoE) for improving the code completion on all PLs.
Score: 21.100570496144694
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code completion is a valuable topic in both academia and industry. Recently, large-scale mono-programming-lingual (MonoPL) pre-training models have been proposed to boost the performance of code completion. However, the code completion on low-resource programming languages (PL) is difficult for the data-driven paradigm, while there are plenty of developers using low-resource PLs. On the other hand, there are few studies exploring the effects of multi-programming-lingual (MultiPL) pre-training for the code completion, especially the impact on low-resource programming languages. To this end, we propose the MultiCoder to enhance the low-resource code completion via MultiPL pre-training and MultiPL Mixture-of-Experts (MoE) layers. We further propose a novel PL-level MoE routing strategy (PL-MoE) for improving the code completion on all PLs. Experimental results on CodeXGLUE and MultiCC demonstrate that 1) the proposed MultiCoder significantly outperforms the MonoPL baselines on low-resource programming languages, and 2) the PL-MoE module further boosts the performance on six programming languages. In addition, we analyze the effects of the proposed method in details and explore the effectiveness of our method in a variety of scenarios.

Related papers

Code-mixed LLM: Improve Large Language Models' Capability to Handle Code-Mixing through Reinforcement Learning from AI Feedback [11.223762031003671]
Code-mixing introduces unique challenges in daily life, such as syntactic mismatches and semantic blending. Large language models (LLMs) have revolutionized the field of natural language processing (NLP) by offering unprecedented capabilities in understanding human languages. We propose to improve the multilingual LLMs' ability to understand code-mixing through reinforcement learning from human feedback (RLHF) and code-mixed machine translation tasks.
arXiv Detail & Related papers (2024-11-13T22:56:00Z)
Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities. The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z)
Bridge-Coder: Unlocking LLMs' Potential to Overcome Language Gaps in Low-Resource Code [31.48411893252137]
Large Language Models (LLMs) demonstrate strong proficiency in generating code for high-resource programming languages (HRPLs) like Python but struggle significantly with low-resource programming languages (LRPLs) such as Racket or D. This performance gap deepens the digital divide, preventing developers using LRPLs from benefiting equally from LLM advancements and reinforcing disparities in innovation within underrepresented programming communities. We introduce a novel approach called Bridge-Coder, which leverages LLMs' intrinsic capabilities to enhance the performance on LRPLs.
arXiv Detail & Related papers (2024-10-24T17:55:03Z)
A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement [24.25119206488625]
PairCoder is a novel framework for large language models (LLMs) to generate code. It incorporates two collaborative agents, namely a Navigator agent for high-level planning and a Driver agent for specific implementation. The Driver follows the guidance of Navigator to undertake initial code generation, code testing, and refinement.
arXiv Detail & Related papers (2024-09-08T07:22:19Z)
Large Language Models for cross-language code clone detection [3.5202378300682162]
Cross-lingual code clone detection has gained traction within the software engineering community. Inspired by the significant advances in machine learning, this paper revisits cross-lingual code clone detection. We evaluate the performance of five (05) Large Language Models (LLMs) and eight prompts (08) for the identification of cross-lingual code clones.
arXiv Detail & Related papers (2024-08-08T12:57:14Z)
Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation [70.22782550540714]
Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW. We introduce a Quantization-aware Scale LeArning method based on multimodal Warmup, termed QSLAW.
arXiv Detail & Related papers (2024-08-07T12:42:09Z)
DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning [36.78560777629329]
We introduce a diverse instruction model (DolphCoder) with self-evaluating for code generation. It learns diverse instruction targets and combines a code evaluation objective to enhance its code generation ability. Our model achieves superior performance on the HumanEval and MBPP benchmarks.
arXiv Detail & Related papers (2024-02-14T12:34:58Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model [58.127534002232096]
This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM. It is specifically designed for code-related tasks with both English and Chinese prompts. CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
arXiv Detail & Related papers (2023-10-10T02:38:44Z)
Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback [61.83548032416181]
We present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages. Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research.
arXiv Detail & Related papers (2023-07-29T18:01:46Z)
CodeT5+: Open Code Large Language Models for Code Understanding and Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z)
Execution-based Code Generation using Deep Reinforcement Learning [8.085533911328577]
PPOCoder is a new framework for code generation that combines pre-trained PL models with Proximal Policy Optimization. PPOCoder seamlessly integrates external code-specific knowledge into the model optimization process. It's important to note that PPOCoder is a task-agnostic and model-agnostic framework that can be used across different code generation tasks and PLs.
arXiv Detail & Related papers (2023-01-31T18:02:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.