Fugu-MT 論文翻訳(概要): RAL-Bench: Benchmarking for Application-Level Functional Correctness and Non-Functional Quality Attributes

論文の概要: RAL-Bench: Benchmarking for Application-Level Functional Correctness and Non-Functional Quality Attributes

arxiv url: http://arxiv.org/abs/2602.03462v1
Date: Tue, 03 Feb 2026 12:35:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-04 18:37:15.444576
Title: RAL-Bench: Benchmarking for Application-Level Functional Correctness and Non-Functional Quality Attributes
Title（参考訳）: RAL-Bench: アプリケーションレベルの機能的正確性と非ファンクション品質属性のベンチマーク
Authors: Ruwei Pan, Yakun Zhang, Qingyuan Liang, Yueheng Zhu, Chao Liu, Lu Zhang, Hongyu Zhang,
Abstract要約: RAL-Benchはアプリケーションレベルのコード生成のためのベンチマークおよび評価フレームワークである。各タスクに対して,高品質な参照プロジェクトから簡潔な自然言語要求を抽出する。 Black-boxシステムテストは機能的および非機能的属性をカバーし、参照リポジトリに渡すテストのみを保持する。
参考スコア（独自算出の注目度）: 12.202503919149118
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code generation has advanced rapidly with code-focused large language models (LLMs), especially on snippet-level tasks. However, application-level generation requires producing a runnable multi-file repository with correct structure, dependencies, and end-to-end executability, and real-world software must satisfy both functional correctness and non-functional quality (e.g., maintainability, security). Existing benchmarks provide a limited execution-based assessment of these requirements at the application level. We ask: Can current LLMs generate application-level repositories that meet both functional and non-functional criteria? We propose RAL-Bench, a benchmark and evaluation framework for application-level code generation. For each task, we distill a concise natural-language requirement from a high-quality reference project, build black-box system tests covering functional and non-functional attributes, and keep only tests that pass on the reference repository to ensure a sound oracle and an end-to-end executable suite. Functional correctness is measured by system-test pass rate. Non-functional quality is measured along five ISO/IEC 25010-inspired dimensions and aggregated with an Analytic Hierarchy Process (AHP)-derived weight vector, with per-dimension diagnostics and baseline-normalized scoring using reference measurements. Across 16 LLMs evaluated zero-shot with greedy decoding, functional correctness is the dominant bottleneck: no model exceeds a 45% functional pass rate under our requirement-driven, reference-validated tests. We release RAL-Bench at https://github.com/Wwstarry/RAL-Bench. .
Abstract（参考訳）: コード生成はコード中心の大規模言語モデル(LLM)、特にスニペットレベルのタスクによって急速に進歩した。しかし、アプリケーションレベルの生成には、正しい構造、依存関係、エンドツーエンドの実行性を備えた実行可能なマルチファイルリポジトリを生成する必要があり、現実世界のソフトウェアは、機能的正確性と非機能的品質(例えば、保守性、セキュリティ)の両方を満たす必要がある。既存のベンチマークは、アプリケーションレベルでこれらの要件を限定的な実行ベースで評価する。現在のLLMは機能的基準と非機能的基準の両方を満たすアプリケーションレベルのリポジトリを生成することができるか? アプリケーションレベルのコード生成のためのベンチマークおよび評価フレームワークであるRAL-Benchを提案する。各タスクに対して、高品質な参照プロジェクトから簡潔な自然言語要件を抽出し、機能的および非機能的属性をカバーするブラックボックスシステムテストを構築し、参照リポジトリに渡されるテストのみを保持し、音のオラクルとエンドツーエンドの実行スイートを保証する。機能的正しさは、システムテストパスレートによって測定される。非機能的品質は、ISO/IEC 25010にインスパイアされた5つの次元に沿って測定され、分析階層プロセス(AHP)由来の重みベクトルで集約される。 16個のLCMを用いてゼロショットの評価を行ったところ、機能的正確性は主要なボトルネックであり、要求駆動の基準検証テストでは、機能的パスレートが45%を超えなかった。 RAL-Benchはhttps://github.com/Wwstarry/RAL-Bench.comで公開しています。と。

論文の概要: RAL-Bench: Benchmarking for Application-Level Functional Correctness and Non-Functional Quality Attributes

関連論文リスト