Fugu-MT 論文翻訳(概要): Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization

論文の概要: Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization

arxiv url: http://arxiv.org/abs/2603.17193v1
Date: Tue, 17 Mar 2026 22:46:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.430105
Title: Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization
Title（参考訳）: 講演はチープで論理は難しい - コンディション後の形式化に関するLLMのベンチマーク
Authors: I. S. W. B. Prasetya, Fitsum Kifetew, Davide Prandi,
Abstract要約: 本稿では,プログラムの FULL プレコンディションとポストコンディションの生成における LLM の有効性について述べる。新たな40タスクのデータセットを用いて24種類の最先端LCMを評価した。
参考スコア（独自算出の注目度）: 0.32665457005470505
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Formal specifications, such as pre- and post-conditions provide a solid basis for performing thorough program verification. However, developers rarely provide such formal specifications, hence if AI could help in constructing them, it would make formal verification possible or at least make automated testing much more effective. This paper presents a study on the ability of LLMs in generating formal FULL pre- and post-conditions of a program, given its natural language description. 24 state-of-the-art LLMs were evaluated on a freshly prepared dataset of 40 tasks. The paper investigates specifications of varying difficulty and discusses a set of more refined performance metrics in addition the general accept@1 performance. It also investigates the impact of using automatically generated tests for validation of the solutions proposed by LLMs. The results of the experiment reveal that, in general LLMs can produce valid pre- and post-conditions based on natural language descriptions of programs. Incorrect solutions from proprietary models are also often near correct. A closer inspection shows that open-source models tend to result in a higher proportion of erroneous results while proprietary models tend to have slightly higher false negative rates. Interestingly, the results also show that augmenting the manually prepared dataset with automatically generated tests leads to the exposure of wrong solutions, which would have otherwise been accepted as correct. In general, LLMs perform better in formalizing pre-conditions than on post-conditions and proprietary models perform better than open ones. However, none of the LLMs were able to correctly formalize all the tasks in our benchmark. Overall, the effectiveness and reliability of LLMs in formalizing pre- and post-conditions could be greatly improved by using a good test suite that checks the correctness of the LLM generated formalizations.
Abstract（参考訳）: 事前条件や後条件のような形式的な仕様は、徹底的なプログラム検証を行うための確かな基盤を提供する。しかし、開発者はそのような形式的な仕様をほとんど提供しないため、もしAIがそれらを構築するのに役立ったら、形式的な検証を可能にするか、少なくとも自動テストをより効果的にするでしょう。本稿では,LLMがプログラムの形式的事前条件と後条件を生成する能力について,その自然言語的記述から検討する。新たな40タスクのデータセットを用いて24種類の最先端LCMを評価した。本稿では、様々な難しさの仕様を調査し、一般的なaccept@1パフォーマンスに加えて、より洗練されたパフォーマンス指標のセットについて論じる。また、LLMが提案するソリューションの検証に自動生成されたテストを使用することの影響についても検討した。実験の結果、LLMはプログラムの自然言語記述に基づいて、有効な事前条件と後条件を生成できることが判明した。プロプライエタリなモデルからの間違った解も、しばしばほぼ正しい。より綿密な検査では、オープンソースモデルは誤った結果の比率が高い傾向を示し、プロプライエタリモデルはわずかに偽陰性率が高い傾向を示す。興味深いことに、結果は、自動生成されたテストで手動で準備されたデータセットを増大させることで、間違ったソリューションが露出されることも示している。一般に、LLMは後条件よりも条件の形式化に優れ、プロプライエタリモデルはオープン条件よりも優れている。しかしながら、LLMはいずれも、ベンチマーク中のすべてのタスクを正しく形式化することができませんでした。全体として,LLMの生成した形式化の正しさをチェックする優れたテストスイートを使用することで,プレコンディションおよびポストコンディションの形式化におけるLCMの有効性と信頼性を大幅に向上させることができる。

論文の概要: Talk is Cheap, Logic is Hard: Benchmarking LLMs on Post-Condition Formalization

関連論文リスト