Fugu-MT 論文翻訳(概要): LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation

論文の概要: LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation

arxiv url: http://arxiv.org/abs/2605.01394v1
Date: Sat, 02 May 2026 11:31:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.748201
Title: LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation
Title（参考訳）: LiveFMBench: 仕様生成におけるエージェントワークフローのパワーと限界を明らかにする
Authors: Dong Xu, Jialun Cao, Guozhao Mo, Junjie Hu, Cheng Wen, Hongyu Lin, Xianpei Han, Shengchao Qin, Cong Tian, Shing-Chi Cheung, Le Sun, Yaojie Lu,
Abstract要約: 大規模言語モデル(LLM)とエージェントは有望な進歩を示しているが、その真の能力と失敗モードは未だ不明である。 CプログラムのためのLCMおよびエージェントベースの形式仕様生成に関する、最初の体系的および汚染に配慮した研究を提案する。
参考スコア（独自算出の注目度）: 75.05397479715576
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Formal specification is essential for rigorous program verification, yet writing correct specifications remains costly and difficult to automate. Although large language models (LLMs) and agents have shown promising progress, their true capabilities and failure modes remain unclear. We present the first systematic and contamination-aware study of LLM- and agent-based formal specification generation for C programs. We introduce LiveFMBench, a continuously evolving benchmark of 630 ACSL (ANSI/ISO C Specification Language)-annotated C programs, including 360 newly collected cases designed to mitigate data leakage. Using this benchmark, we evaluate direct prompting with different sampling sizes, reasoning-enabled (thinking mode) inference, the agentic pipeline, and perform a fine-grained failure analysis. Experimental results reveal that naive evaluation substantially overestimates performance because models under direct prompting may exhibit unfaithful behaviors, such as deceiving automated provers or ignoring code-context constraints; after excluding such cases, the true specification generation accuracy drops by approximately 20\%. We further find that both increased sampling and thinking mode significantly improve success rates, with smaller models benefiting more from thinking mode. Agentic pipelines are particularly effective under low sampling budgets and on harder datasets. Failure analysis further shows that incorrect loop invariants are the dominant error type, while agentic pipelines notably reduce assertion errors. These results expose fundamental limitations in current LLM-based approaches and suggest they remain far from replacing human-authored formal specifications. We release LiveFMBench at https://huggingface.co/datasets/fm-universe/Live-FM-Bench and all evaluation artifacts to support future research.
Abstract（参考訳）: 形式仕様は厳密なプログラム検証には不可欠だが、正確な仕様を書くことはコストがかかり、自動化が難しい。大きな言語モデル(LLM)とエージェントは将来的な進歩を示しているが、その真の能力と失敗モードは未だ不明である。 CプログラムのためのLCMおよびエージェントベースの形式仕様生成に関する、最初の体系的および汚染に配慮した研究を提案する。 630 ACSL (ANSI/ISO C Specification Language) アノテーション付き C プログラムの継続的なベンチマークである LiveFMBench を紹介する。このベンチマークを用いて、異なるサンプリングサイズ、推論可能な推論可能な(思考モード)推論、エージェントパイプラインを用いて直接プロンプトを評価し、きめ細かい故障解析を行う。実験結果から,直接的プロンプト下でのモデルでは,自動プロバーの誤認やコードコンテキスト制約の無視といった不誠実な動作が生じる可能性があるため,本手法を除外すると,真の仕様生成精度が約20倍低下する可能性が示唆された。さらに、サンプリングモードと思考モードの増加が成功率を著しく向上させ、より小さなモデルの方が思考モードの恩恵を受けることが判明した。エージェントパイプラインは、低サンプリング予算とより厳しいデータセットで特に有効である。フェール解析により、誤ループ不変が主流のエラータイプであり、エージェントパイプラインは特にアサーションエラーを減少させる。これらの結果は、現在のLLMベースのアプローチにおける基本的な制限を明らかにしており、人間による正式な仕様の置き換えには程遠いことを示唆している。私たちはLiveFMBenchをhttps://huggingface.co/datasets/fm-universe/Live-FM-Benchでリリースします。

論文の概要: LiveFMBench: Unveiling the Power and Limits of Agentic Workflows in Specification Generation

関連論文リスト