Related papers: Can Large Language Models Write Parallel Code?

Can Large Language Models Write Parallel Code?

URL: http://arxiv.org/abs/2401.12554v3
Date: Tue, 14 May 2024 15:07:58 GMT
Title: Can Large Language Models Write Parallel Code?
Authors: Daniel Nichols, Joshua H. Davis, Zhaojun Xie, Arjun Rajaram, Abhinav Bhatele,
Abstract summary: Large language models are increasingly becoming a popular tool for software development. In this paper, we study the capabilities of state-of-the-art language models to generate parallel code.
Score: 0.5317767988097261
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models are increasingly becoming a popular tool for software development. Their ability to model and generate source code has been demonstrated in a variety of contexts, including code completion, summarization, translation, and lookup. However, they often struggle to generate code for complex programs. In this paper, we study the capabilities of state-of-the-art language models to generate parallel code. In order to evaluate language models, we create a benchmark, ParEval, consisting of prompts that represent 420 different coding tasks related to scientific and parallel computing. We use ParEval to evaluate the effectiveness of several state-of-the-art open- and closed-source language models on these tasks. We introduce novel metrics for evaluating the performance of generated code, and use them to explore how well each large language model performs for 12 different computational problem types and six different parallel programming models.

Related papers

IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
A Multi-Language Perspective on the Robustness of LLM Code Generation [2.580765958706854]
We conduct a comprehensive comparative analysis to assess the robustness of several prominent code generation models. We introduce perturbations in four key areas of the prompt: DocString, function name, syntax, and format. This work presents our experimental findings, shedding light on the performance of code generation models in various scenarios.
arXiv Detail & Related papers (2025-04-27T05:00:21Z)
CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages. It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total. Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z)
CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language. Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems. static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models. We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z)
xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval [32.60391966381949]
We introduce xCodeEval, the largest executable multilingual multitask benchmark to date. It features a total of $7$ tasks involving code understanding, generation, translation and retrieval. xCodeEval adopts an execution-based evaluation and offers a multilingual code execution engine, ExecEval.
arXiv Detail & Related papers (2023-03-06T10:08:51Z)
Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages. We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z)
A Scalable and Extensible Approach to Benchmarking NL2Code for 18 Programming Languages [1.6312827172331896]
We propose MultiPL-E, the first multi-language parallel benchmark for natural-language-to-code-generation. We evaluate two state-of-the-art code generation models on MultiPL-E: Codex and InCoder. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance.
arXiv Detail & Related papers (2022-08-17T11:16:52Z)
AVATAR: A Parallel Corpus for Java-Python Program Translation [77.86173793901139]
Program translation refers to migrating source code from one language to another. We present AVATAR, a collection of 9,515 programming problems and their solutions written in two popular languages, Java and Python.
arXiv Detail & Related papers (2021-08-26T05:44:20Z)
Automated Source Code Generation and Auto-completion Using Deep Learning: Comparing and Discussing Current Language-Model-Related Approaches [0.0]
This paper compares different deep learning architectures to create and use language models based on programming code. We discuss each approach's different strengths and weaknesses and what gaps we find to evaluate the language models or apply them in a real programming context.
arXiv Detail & Related papers (2020-09-16T15:17:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.