IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators
- URL: http://arxiv.org/abs/2403.03894v3
- Date: Mon, 15 Apr 2024 16:29:41 GMT
- Title: IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators
- Authors: Indraneil Paul, Goran Glavaš, Iryna Gurevych,
- Abstract summary: This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
- Score: 49.903001442804594
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Code understanding and generation have fast become some of the most popular applications of language models (LMs). Nonetheless, research on multilingual aspects of Code-LMs (i.e., LMs for code generation) such as cross-lingual transfer between different programming languages, language-specific data augmentation, and post-hoc LM adaptation, alongside exploitation of data sources other than the original textual content, has been much sparser than for their natural language counterparts. In particular, most mainstream Code-LMs have been pre-trained on source code files alone. In this work, we investigate the prospect of leveraging readily available compiler intermediate representations (IR) - shared across programming languages - to improve the multilingual capabilities of Code-LMs and facilitate cross-lingual transfer. To this end, we first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files coupled with respective intermediate representations. Next, starting from various base Code-LMs (ranging in size from 1.1B to 7.3B parameters), we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to (1) learn the IR language and (2) align the IR constructs with respective constructs of various programming languages. Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics, including prompt robustness, multilingual code completion, code understanding, and instruction following.
Related papers
- Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - Bridging the Language Gap: Enhancing Multilingual Prompt-Based Code Generation in LLMs via Zero-Shot Cross-Lingual Transfer [5.355430735475281]
This paper investigates the complexities of multilingual prompt-based code generation.
Our evaluations reveal significant disparities in code quality for non-English prompts.
We propose a zero-shot cross-lingual approach using a neural projection technique.
arXiv Detail & Related papers (2024-08-19T05:11:46Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - Exploring Multi-Lingual Bias of Large Code Models in Code Generation [55.336629780101475]
Code generation aims to synthesize code and fulfill functional requirements based on natural language (NL) specifications.
Despite the effectiveness, we observe a noticeable multilingual bias in the generation performance of large code models (LCMs)
LCMs demonstrate proficiency in generating solutions when provided with instructions in English, yet may falter when faced with semantically equivalent instructions in other NLs such as Chinese.
arXiv Detail & Related papers (2024-04-30T08:51:49Z) - UltraLink: An Open-Source Knowledge-Enhanced Multilingual Supervised
Fine-tuning Dataset [69.33424532827608]
Open-source large language models (LLMs) have gained significant strength across diverse fields.
In this work, we construct an open-source multilingual supervised fine-tuning dataset.
The resulting UltraLink dataset comprises approximately 1 million samples across five languages.
arXiv Detail & Related papers (2024-02-07T05:05:53Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for
Programming Languages [37.60016772021422]
Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa.
Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric.
We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs.
arXiv Detail & Related papers (2022-12-13T17:21:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.