Investigating the Performance of Language Models for Completing Code in Functional Programming Languages: a Haskell Case Study
- URL: http://arxiv.org/abs/2403.15185v1
- Date: Fri, 22 Mar 2024 13:13:13 GMT
- Title: Investigating the Performance of Language Models for Completing Code in Functional Programming Languages: a Haskell Case Study
- Authors: Tim van Dam, Frank van der Heijden, Philippe de Bekker, Berend Nieuwschepen, Marc Otten, Maliheh Izadi,
- Abstract summary: We evaluate two language models for code, CodeGPT and UniXcoder, on the functional programming language Haskell.
We fine-tune and evaluate the models on Haskell functions sourced from a publicly accessible Haskell dataset on HuggingFace.
Our automatic evaluation shows that knowledge of imperative programming languages in the pre-training of LLMs may not transfer well to functional languages.
- Score: 2.792812922172466
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Language model-based code completion models have quickly grown in use, helping thousands of developers write code in many different programming languages. However, research on code completion models typically focuses on imperative languages such as Python and JavaScript, which results in a lack of representation for functional programming languages. Consequently, these models often perform poorly on functional languages such as Haskell. To investigate whether this can be alleviated, we evaluate the performance of two language models for code, CodeGPT and UniXcoder, on the functional programming language Haskell. We fine-tune and evaluate the models on Haskell functions sourced from a publicly accessible Haskell dataset on HuggingFace. Additionally, we manually evaluate the models using our novel translated HumanEval dataset. Our automatic evaluation shows that knowledge of imperative programming languages in the pre-training of LLMs may not transfer well to functional languages, but that code completion on functional languages is feasible. Consequently, this shows the need for more high-quality Haskell datasets. A manual evaluation on HumanEval-Haskell indicates CodeGPT frequently generates empty predictions and extra comments, while UniXcoder more often produces incomplete or incorrect predictions. Finally, we release HumanEval-Haskell, along with the fine-tuned models and all code required to reproduce our experiments on GitHub (https://github.com/AISE-TUDelft/HaskellCCEval).
Related papers
- Perish or Flourish? A Holistic Evaluation of Large Language Models for Code Generation in Functional Programming [3.2230833657560503]
We introduce FPEval, a new benchmark of 721 programming tasks across three difficulty levels on three mainstream programming languages: Haskell, Ocaml and Scala.<n>Using this framework, we evaluate state-of-the-art Large Language Models (LLMs) for code generation in functional programming languages and Java.
arXiv Detail & Related papers (2026-01-05T12:33:37Z) - Functional Python Programming in Introductory Computer Science Courses [1.8139737455709233]
We present a best practice'' idea in introductory programming classes that forces students to learn and complete programming assignments in a purely functional subset of Python.<n>By doing so, the student can learn functional ideas such as immutability, pure functions with no side effects, and stateless programming.
arXiv Detail & Related papers (2025-12-03T06:39:08Z) - Type-Constrained Code Generation with Language Models [51.03439021895432]
Large language models (LLMs) produce uncompilable output because their next-token inference procedure does not model formal aspects of code.
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.
Our approach reduces compilation errors by more than half and increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z) - CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution [50.7413285637879]
The CRUXEVAL-X code reasoning benchmark contains 19 programming languages.
It comprises at least 600 subjects for each language, along with 19K content-consistent tests in total.
Even a model trained solely on Python can achieve at most 34.4% Pass@1 in other languages.
arXiv Detail & Related papers (2024-08-23T11:43:00Z) - Can Large Language Models Write Parallel Code? [0.5317767988097261]
Large language models are increasingly becoming a popular tool for software development.
In this paper, we study the capabilities of state-of-the-art language models to generate parallel code.
arXiv Detail & Related papers (2024-01-23T08:25:12Z) - GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization
in Programming Language Understanding [5.9535699822923]
We propose a new benchmark dataset called GenCodeSearchNet (GeCS) to evaluate the programming language understanding capabilities of language models.
As part of the full dataset, we introduce a new, manually curated subset StatCodeSearch that focuses on R, a popular but so far underrepresented programming language.
For evaluation and comparison, we collect several baseline results using fine-tuned BERT-style models and GPT-style large language models.
arXiv Detail & Related papers (2023-11-16T09:35:00Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z) - CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code [75.08995072899594]
We propose CodeBERTScore: an evaluation metric for code generation.
CodeBERTScore encodes the natural language input preceding the generated code.
We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics.
arXiv Detail & Related papers (2023-02-10T22:12:05Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - A Systematic Evaluation of Large Language Models of Code [88.34057460577957]
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions.
The current state-of-the-art code LMs are not publicly available, leaving many questions about their model and data design decisions.
Although Codex is not open-source, we find that existing open-source models do achieve close results in some programming languages.
We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine.
arXiv Detail & Related papers (2022-02-26T15:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.