Fugu-MT 論文翻訳(概要): Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

論文の概要: Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

arxiv url: http://arxiv.org/abs/2605.26457v1
Date: Tue, 26 May 2026 02:12:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.575684
Title: Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization
Title（参考訳）: Verus-SpecGym:仕様自動化評価のためのエージェント環境
Authors: Anmol Agarwal, Natalie Neamtu, Pranjal Aggarwal, Seungone Kim, Jannis Limperg, Cedric Flamant, Kanna Shimizu, Bryan Parno, Sean Welleck,
Abstract要約: LLMエージェントが非公式なプログラミング問題を忠実な形式仕様に変換することができるかどうか、仕様自動書式化について検討する。 Codeforces問題から派生した581の仕様記述タスクのベンチマークであるVerus-SpecBenchを紹介する。フェールモードの解析は、モデル生成仕様が重要な入力仮定を受け入れ、誤った出力を受け入れ、有効な仕様を拒否できることを示している。
参考スコア（独自算出の注目度）: 26.123396123145415
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym
Abstract（参考訳）: AIコーディングエージェントは、現実世界のソフトウェアを書くのにますます使われていますが、アウトプットが正しいことを保証することは、依然として根本的な課題です。エージェントは、マシンチェックされた証明と共にコードを生成し、コードが正式な仕様を満たすことを保証します。しかし、正式な仕様自体がユーザの意図と一致するという保証はない。本研究では, LLMエージェントが非公式なプログラミング問題を忠実な形式仕様に変換することができるかどうか, 仕様の自動書式化について検討する。我々は、VerusをターゲットとするCodeforces問題から派生した581の仕様記述タスクのベンチマークであるVerus-SpecBenchと、モデルがVerus、bash、ファイルシステムと相互作用するエージェント環境であるVerus-SpecGymを紹介した。専門家が書いた参照仕様は書けず、LCMの審査員は微妙なミスを犯すことがある。我々はこの問題に対処する (a)Verusのexec_specメカニズムを拡張して、生成された仕様をRustコードとして実行できるようにする。 b) 公式のCodeforcesテストとCodeforcesの"ハック"から抽出した敵のケースに対するテスト。最も強力なモデルであるVerus-SpecBenchでは、Gemini 3.1 Proが77.8%のタスクを解決し、他のフロンティアモデルでは51.1-57.8%とOSSモデルは21.5-25.5%にしか達していない。フェールモードの解析は、モデル生成スペックが重要な入力仮定を省略し、誤った出力を受け入れ、有効なものを拒否できることを示している。また、LCM-as-a-judge評価では、評価者がキャッチする失敗の26%を見逃していることもわかりました。全体としては,仕様の自動書式化はフロンティアエージェントの手が届く範囲内にあるが,すでに正しいコードを生成できる問題でさえも脆弱な状態にあることが示唆されている。コード、データ、ログはhttps://github.com/formal-verif-is-cool/verus-spec-gymにある。

論文の概要: Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

関連論文リスト