Fugu-MT 論文翻訳(概要): Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

論文の概要: Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

arxiv url: http://arxiv.org/abs/2606.05792v1
Date: Thu, 04 Jun 2026 07:22:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.617683
Title: Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation
Title（参考訳）: LLMは正しいTLA+仕様を書けるか? : 自然言語からTLA+生成の評価
Authors: Arslan Bisharat, Brian Ortiz, Eric Spencer, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad,
Abstract要約: TLA+はAmazonやMicrosoftなどの企業での工業的検証をサポートしているが、自然言語から正しいTLA+仕様を書くには時間と専門知識が必要である。本稿では,LLMをベースとしたTLA+仕様合成を自然言語から初めて体系的に評価する。
参考スコア（独自算出の注目度）: 1.550528651800741
License: http://creativecommons.org/licenses/by/4.0/
Abstract: TLA+ has supported industrial verification at companies such as Amazon and Microsoft, yet writing correct TLA+ specifications from natural language still requires time and expertise, which limits adoption. LLMs show promise, but no prior study measures whether they produce semantically correct TLA+ specifications from natural language. This paper presents the first systematic evaluation of LLM-based TLA+ specification synthesis from natural language. Our study evaluates 30 LLMs across eight families on a curated dataset of 205 TLA+ specifications: 25 open-weight models across four prompting strategies (2,600 runs) and 5 proprietary models under few-shot prompting (130 runs), all validated by the SANY parser and TLC model checker. LLMs achieve up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successes exclusive to progressive prompting. Results show that model size does not predict quality, e.g., DeepSeek r1:8b outperforms its 70B variant across all strategies, which suggests the importance of reasoning alignment for formal languages. Code-specialized models consistently underperform due to negative transfer from mainstream language training. We identify five recurring hallucination categories, all traceable to specific training data biases. These results suggest that current LLMs do not generate reliable TLA+ specifications without expert oversight. We release the evaluation framework, code, and dataset to support reproducibility and future research.
Abstract（参考訳）: TLA+はAmazonやMicrosoftなどの企業での産業的検証をサポートしているが、自然言語から正しいTLA+仕様を書くには時間と専門知識が必要であるため、採用が制限されている。 LLMは将来性を示すが、自然言語から意味論的に正しいTLA+仕様を作成するかどうかの事前の研究は行われていない。本稿では,LLMをベースとしたTLA+仕様合成を自然言語から初めて体系的に評価する。本研究は,4つのプロンプト戦略にまたがる25のオープンウェイトモデル (2,600ラン) と5つのプロプライエタリモデル (130ラン) を,SANYパーサとTLCモデルチェッカーで検証した8つのファミリーにまたがる30のLLMを評価した。 LLMは最大26.6%の構文的正当性を達成するが、意味的正当性はわずか8.6%であり、成功はプログレッシブプロンプトのみである。結果から,モデルサイズが品質を予測できないこと,例えばDeepSeek r1:8bは,すべての戦略で70Bを上回り,形式言語における推論アライメントの重要性が示唆された。コード特化モデルは、主流言語のトレーニングからの否定的な移行により、一貫してパフォーマンスが低下する。特定のトレーニングデータバイアスに追従可能な5つの幻覚カテゴリーを同定した。これらの結果から,現在のLLMは専門家の監視なしに信頼性の高いTLA+仕様を作成できないことが示唆された。再現性と今後の研究を支援するための評価フレームワーク、コード、データセットをリリースする。

論文の概要: Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

関連論文リスト