Fugu-MT 論文翻訳(概要): VERINA: Benchmarking Verifiable Code Generation

論文の概要: VERINA: Benchmarking Verifiable Code Generation

arxiv url: http://arxiv.org/abs/2505.23135v1
Date: Thu, 29 May 2025 06:12:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-30 18:14:07.709875
Title: VERINA: Benchmarking Verifiable Code Generation
Title（参考訳）: VERINA: 検証可能なコード生成のベンチマーク
Authors: Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, Dawn Song,
Abstract要約: 大規模言語モデル(LLM)は、ソフトウェア開発にますます統合されている。検証可能なコード生成は、この制限に対処するための有望なパスを提供する。現在のベンチマークでは、エンドツーエンドの検証可能なコード生成がサポートされていないことが多い。
参考スコア（独自算出の注目度）: 47.9771074559674
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often lack support for end-to-end verifiable code generation. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, generates only 61.4% correct code, 51.0% sound and complete specifications, and 3.6% successful proofs, with one trial per task. We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.
Abstract（参考訳）: 大規模言語モデル(LLM)はソフトウェア開発にますます統合されているが、LLM生成コードの正確性を保証することは依然として困難であり、しばしば手作業によるレビューを必要とする。検証可能なコード生成 -- コード、仕様、コード固有化アライメントの証明を共同で生成する -- は、この制限に対処し、コーディングにおけるLLMのメリットをさらに解き放つ、有望な道を提供する。現在のベンチマークでは、エンドツーエンドの検証可能なコード生成がサポートされていないことが多い。本稿では,コード,仕様,証明生成を包括的かつモジュール的に評価できる高品質なベンチマークであるVerina(Verifiable Code Generation Arena)を紹介する。 Verinaは189の手作業によるリーンのコーディングタスクで構成されており、詳細な問題記述、リファレンス実装、正式な仕様、広範なテストスイートを備えている。現状のLLMの広範な評価は、検証可能なコード生成、特に証明生成において、検証領域におけるLLMベースの定理証明の改善の必要性を浮き彫りにしている。最高のモデルであるOpenAI o4-miniは61.4%の正しいコードと51.0%の音と完全な仕様、そして3.6%の成功例を生成できる。 Verinaが厳格で包括的なベンチマークを提供することで、検証可能なコード生成の進歩を触媒することを期待しています。データセットはhttps://huggingface.co/datasets/sunblaze-ucb/verinaで、評価コードはhttps://github.com/sunblaze-ucb/verinaで公開しています。

論文の概要: VERINA: Benchmarking Verifiable Code Generation

関連論文リスト