MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks
- URL: http://arxiv.org/abs/2312.13322v2
- Date: Tue, 17 Sep 2024 16:29:03 GMT
- Title: MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks
- Authors: Tal Kadosh, Niranjan Hasabnis, Vy A. Vo, Nadav Schneider, Neva Krien, Mihai Capota, Abdul Wasay, Nesreen Ahmed, Ted Willke, Guy Tamir, Yuval Pinter, Timothy Mattson, Gal Oren,
- Abstract summary: A growing trend in AI for software development is to develop large language models (LLMs) to address a variety of programming tasks.
Even LLMs applied to tasks from the high-performance computing ( HPC) domain are huge in size and demand expensive compute resources for training.
This is partly because LLMs for HPC tasks are obtained by finetuning existing LLMs that support several natural and/or programming languages.
We build an HPC-specific LM, named MonoCoder, which is orders of magnitude smaller than existing LMs but delivers better performance on non- HPC and HPC codes.
- Score: 5.125171374181664
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With easier access to powerful compute resources, there is a growing trend in AI for software development to develop large language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size and demand expensive compute resources for training. This is partly because LLMs for HPC tasks are obtained by finetuning existing LLMs that support several natural and/or programming languages. We found this design choice confusing - why do we need LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question choices made by existing LLMs by developing smaller language models (LMs) for specific domains - we call them domain-specific LMs. Specifically, we start with HPC as a domain and build an HPC-specific LM, named MonoCoder, which is orders of magnitude smaller than existing LMs but delivers better performance on non-HPC and HPC codes. Specifically, we pre-trained MonoCoder on an HPC-specific dataset (named HPCorpus) of C and C++ programs mined from GitHub. We evaluated the performance of MonoCoder against state-of-the-art multi-lingual LLMs. Results demonstrate that MonoCoder, although much smaller than existing LMs, outperforms other LLMs on normalized-perplexity tests (in relation to model size) while also delivering competing CodeBLEU scores for high-performance and parallel code generations. In other words, results suggest that MonoCoder understands HPC code better than state-of-the-art LLMs.
Related papers
- Multi-Programming Language Sandbox for LLMs [78.99934332554963]
out-of-the-box multi-programming language sandbox designed to provide unified and comprehensive feedback from compiler and analysis tools for Large Language Models (LLMs)
It can automatically identify the programming language of the code, compiling and executing it within an isolated sub-sandbox to ensure safety and stability.
arXiv Detail & Related papers (2024-10-30T14:46:43Z) - Rome was Not Built in a Single Step: Hierarchical Prompting for LLM-based Chip Design [22.70660876673987]
Large Language Models (LLMs) are effective in computer hardware synthesis via hardware description language (HDL) generation.
However, LLM-assisted approaches for HDL generation struggle when handling complex tasks.
We introduce a suite of hierarchical prompting techniques which facilitate efficient stepwise design methods.
arXiv Detail & Related papers (2024-07-23T21:18:31Z) - IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - StepCoder: Improve Code Generation with Reinforcement Learning from
Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components.
CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks.
FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization.
Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z) - OMPGPT: A Generative Pre-trained Transformer Model for OpenMP [6.917568654215119]
OMPGPT is a novel domain-specific model meticulously designed to harness the inherent strengths of language models for OpenMP pragma generation.
We leverage prompt engineering techniques from the NLP domain to create Chain-of-OMP, an innovative strategy designed to enhance OMPGPT's effectiveness.
arXiv Detail & Related papers (2024-01-28T06:06:59Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - Scope is all you need: Transforming LLMs for HPC Code [5.0227775038998415]
We propose a novel tokenizer named Tokompiler, designed specifically for preprocessing code in HPC and compilation-centric tasks.
Tokompiler leverages knowledge of language primitives to generate language-oriented tokens, providing a context-aware understanding of code structure.
Results demonstrate that Tokompiler significantly enhances code completion accuracy and semantic understanding compared to traditional tokenizers.
arXiv Detail & Related papers (2023-08-18T10:12:03Z) - Okapi: Instruction-tuned Large Language Models in Multiple Languages
with Reinforcement Learning from Human Feedback [61.83548032416181]
We present Okapi, the first system with instruction-tuned LLMs based on RLHF for multiple languages.
Okapi introduces instruction and response-ranked data in 26 diverse languages to facilitate the experiments and development of future multilingual LLM research.
arXiv Detail & Related papers (2023-07-29T18:01:46Z) - HPC-Coder: Modeling Parallel Programs using Large Language Models [2.3101915391170573]
We show how large language models can be applied to tasks specific to high performance and scientific codes.
We introduce a new dataset of HPC and scientific codes and use it to fine-tune several pre-trained models.
In our experiments, we show that this model can auto-complete HPC functions where generic models cannot.
arXiv Detail & Related papers (2023-06-29T19:44:55Z) - LLM-Pruner: On the Structural Pruning of Large Language Models [65.02607075556742]
Large language models (LLMs) have shown remarkable capabilities in language understanding and generation.
We tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset.
Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures.
arXiv Detail & Related papers (2023-05-19T12:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.