Fugu-MT 論文翻訳(概要): s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

論文の概要: s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

arxiv url: http://arxiv.org/abs/2603.14628v1
Date: Sun, 15 Mar 2026 21:55:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.924411
Title: s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs
Title（参考訳）: s2n-bignum-bench:LLMの低レベルコード推論評価のための実用的なベンチマーク
Authors: Balaji Rao, John Harrison, Soonho Kong, Juneyoung Lee, Carlo Lipizzi,
Abstract要約: s2n-bignumは、暗号化の高速なアセンブリルーチンを提供するためにAWSで使用されるライブラリである。 textits2n-bignum-bench では、正式な仕様を提供し、HOL Light で受け入れられる証明スクリプトを生成するよう LLM に依頼する。このベンチマークは、競争数学を超えて証明された LLM ベースの定理を評価する上で、挑戦的で実用的なテストベッドを提供する。
参考スコア（独自算出の注目度）: 0.45671221781968324
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Neurosymbolic approaches leveraging Large Language Models (LLMs) with formal methods have recently achieved strong results on mathematics-oriented theorem-proving benchmarks. However, success on competition-style mathematics does not by itself demonstrate the ability to construct proofs about real-world implementations. We address this gap with a benchmark derived from an industrial cryptographic library whose assembly routines are already verified in HOL Light. s2n-bignum is a library used at AWS for providing fast assembly routines for cryptography, and its correctness is established by formal verification. The task of formally verifying this library has been a significant achievement for the Automated Reasoning Group. It involved two tasks: (1) precisely specifying the correct behavior of a program as a mathematical proposition, and (2) proving that the proposition is correct. In the case of s2n-bignum, both tasks were carried out by human experts. In \textit{s2n-bignum-bench}, we provide the formal specification and ask the LLM to generate a proof script that is accepted by HOL Light within a fixed proof-check timeout. To our knowledge, \textit{s2n-bignum-bench} is the first public benchmark focused on machine-checkable proof synthesis for industrial low-level cryptographic assembly routines in HOL Light. This benchmark provides a challenging and practically relevant testbed for evaluating LLM-based theorem proving beyond competition mathematics. The code to set up and use the benchmark is available here: \href{https://github.com/kings-crown/s2n-bignum-bench}{s2n-bignum-bench}.
Abstract（参考訳）: 大規模言語モデル(LLM)を形式的手法で活用するニューロシンボリックアプローチは,最近数学指向の定理証明ベンチマークにおいて大きな成果を上げている。しかし、競争スタイルの数学における成功は、それ自体が実世界の実装に関する証明を構築する能力を示すものではない。このギャップを、HOL Lightですでにアセンブリルーチンが検証されている産業用暗号ライブラリから派生したベンチマークで解決する。 s2n-bignumは、暗号化のための高速なアセンブリルーチンを提供するためにAWSで使用されているライブラリであり、その正確性は正式な検証によって確立されている。このライブラリを正式に検証する作業は、Automated Reasoning Groupにとって重要な成果となった。 1) プログラムの正しい振る舞いを数学的命題として正確に指定し、(2) 命題が正しいことを証明する。 s2n-bignumの場合、どちらのタスクも人間の専門家によって実行された。 textit{s2n-bignum-bench} では、正式な仕様を提供し、固定された証明チェックタイムアウト内で HOL Light で受け入れられる証明スクリプトを生成するよう LLM に要求する。我々の知る限り、textit{s2n-bignum-bench}はHOL Lightにおける産業用低レベル暗号アセンブリーの機械チェック可能な証明合成に焦点を当てた最初の公開ベンチマークである。このベンチマークは、競争数学を超えて証明された LLM ベースの定理を評価する上で、挑戦的で実用的なテストベッドを提供する。ベンチマークの設定と使用のためのコードは以下の通りである。

論文の概要: s2n-bignum-bench: A practical benchmark for evaluating low-level code reasoning of LLMs

関連論文リスト