Fugu-MT 論文翻訳(概要): Perish or Flourish? A Holistic Evaluation of Large Language Models for Code Generation in Functional Programming

論文の概要: Perish or Flourish? A Holistic Evaluation of Large Language Models for Code Generation in Functional Programming

arxiv url: http://arxiv.org/abs/2601.02060v1
Date: Mon, 05 Jan 2026 12:33:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-06 16:25:23.061673
Title: Perish or Flourish? A Holistic Evaluation of Large Language Models for Code Generation in Functional Programming
Title（参考訳）: パーリッシュかフルリッシュか? 関数型プログラミングにおけるコード生成のための大規模言語モデルの完全性評価
Authors: Nguyet-Anh H. Lang, Eric Lang, Thanh Le-Cong, Bach Le, Quyet-Thang Huynh,
Abstract要約: FPEvalは、Haskell、Ocaml、Scalaの3つの主要なプログラミング言語における3つの困難レベルにわたる721のプログラミングタスクの新しいベンチマークである。このフレームワークを用いて,関数型プログラミング言語とJavaにおけるコード生成のための,最先端の大規模言語モデル(LLM)を評価する。
参考スコア（独自算出の注目度）: 3.2230833657560503
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Functional programming provides strong foundations for developing reliable and secure software systems, yet its adoption remains not widespread due to the steep learning curve. Recent advances in Large Language Models (LLMs) for code generation present new opportunities to lower these barriers. However, extensive evaluations of LLMs largely focus on imperative programming languages, and their capabilities in functional programming languages (FP) remain underexplored. To address this gap, we introduce FPEval, a holistic evaluation framework built on FPBench, a new benchmark of 721 programming tasks across three difficulty levels on three mainstream FP languages: Haskell, Ocaml and Scala. FPEval provides compehensive evaluation infrastructures with both test validations with comprehensive test suites and static analysis tools to assess both functional correctness and code style and maintainability. Using this framework, we evaluate state-of-the-art LLMs, including GPT-3.5, GPT-4o, and GPT-5, for code generation in functional programming languages and Java as an imperative baseline. Our results demonstrate that LLM performance in functional programming improves substantially with model advancement; however, error rates remain significantly higher in purely functional languages (Haskell and OCaml) than in hybrid (Scala) or imperative (Java) languages. Moreover, LLMs frequently generate non-idiomatic functional code that follows imperative patterns, raising concerns about code style and long-term maintainability. Finally, we show that LLMs can partially self-repair both correctness and quality issues when provided with static analysis feedback and hand-crafted instructions for common types of issues.
Abstract（参考訳）: 関数型プログラミングは、信頼性が高くセキュアなソフトウェアシステムを開発するための強力な基盤を提供するが、学習曲線の急激さのため、その採用は広まっていない。コード生成のための大規模言語モデル(LLM)の最近の進歩は、これらの障壁を低くする新たな機会を提供する。しかし、LLMの広範な評価は命令型プログラミング言語に重点を置いており、関数型プログラミング言語(FP)の能力はいまだに未熟である。このギャップに対処するために、FPEvalという、FPBench上に構築された総合的な評価フレームワークを紹介します。これは、Haskell、Ocaml、Scalaの3つの主要なFP言語における3つの困難レベルにわたる721のプログラミングタスクの新しいベンチマークです。 FPEvalは、総合的なテストスイートを備えたテスト検証と、機能的正確性とコードスタイル、保守性の両方を評価する静的解析ツールを備えた、統合的な評価インフラストラクチャを提供する。本稿では,GPT-3.5,GPT-4o,GPT-5といった最先端のLCMを用いて,関数型プログラミング言語とJavaのコード生成を命令型ベースラインとして評価する。関数型プログラミングにおけるLLMの性能はモデル進行とともに大幅に向上することを示したが、純粋関数型言語(Haskell と OCaml)ではハイブリッド言語(Scala)や命令型言語(Java)よりもエラー率が有意に高いままである。さらに、LLMは命令型パターンに従う非慣用的な機能コードを生成することが多く、コードスタイルや長期の保守性に対する懸念が高まる。最後に,LLMは静的な解析フィードバックと手作りによる一般的な問題に対する指示を与えると,その正しさと品質の問題の両方を部分的に自己修復できることを示す。

論文の概要: Perish or Flourish? A Holistic Evaluation of Large Language Models for Code Generation in Functional Programming

関連論文リスト