Fugu-MT 論文翻訳(概要): Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

論文の概要: Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

arxiv url: http://arxiv.org/abs/2605.10379v1
Date: Mon, 11 May 2026 11:23:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.762018
Title: Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
Title（参考訳）: すべての証明が同等であるとは限らない:LLMの証明品質を正確性を超えて評価する
Authors: Ivo Petrov, Jasper Dekoninck, Dimitar I. Dimitrov, Martin Vechev,
Abstract要約: 大規模言語モデル(LLM)は数学的な問題解決に有効である。 ProofRankは、挑戦的な数学的競争から得られたベンチマークである。正当性のみのベンチマークでは得られない証明品質にはかなりの違いがある。
参考スコア（独自算出の注目度）: 7.694715050727414
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have become capable mathematical problem-solvers, often producing correct proofs for challenging problems. However, correctness alone is not sufficient: mathematical proofs should also be clear, concise, insightful, and transferable to other problems. While this proof quality is subjective and depends on the reader and context, many of its components are concrete and broadly valued. In this work, we identify such components and introduce ProofRank, a benchmark curated from challenging mathematical competitions. ProofRank evaluates several scalable proxies of proof quality: (i) conciseness, measuring whether proofs avoid unnecessary steps; (ii) computational ease, measuring the extent to which a proof relies on tedious calculations; (iii) cognitive simplicity, measuring how accessible the used proof techniques are; (iv) diversity, measuring how varied a model's proofs for a single problem are; and (v) adaptivity, measuring whether a model can follow a specified proof technique. Across models, we find substantial differences in proof quality that are not captured by correctness-only benchmarks. We also observe significant trade-offs between proof-quality metrics and correctness, suggesting that future evaluations of mathematical reasoning should measure how useful LLM-generated proofs are.
Abstract（参考訳）: 大規模言語モデル(LLM)は数学的な問題解法となり、しばしば問題に対する正しい証明を生み出している。しかし、正しさだけでは十分ではない:数学的証明は明確で簡潔で、洞察に富み、他の問題に転移できる。この証明品質は主観的であり、読者と文脈に依存するが、そのコンポーネントの多くは具体的で広く評価されている。本研究では,これらのコンポーネントを同定し,挑戦的な数学的競合から算出したベンチマークであるProofRankを紹介する。 ProofRankは、いくつかのスケーラブルな証明品質プロキシを評価している。一証明が不要な措置を免れるか否かを測る簡潔さ二計算の容易さ、証明が退屈な計算に依存する程度を測定すること。三認識の単純さ、使用済みの証明技術がどの程度アクセス可能かを測定すること。四一つの問題に対するモデルの証明がどの程度異なるかを測定すること。 (v)適応性は、あるモデルが特定の証明手法に従うことができるかどうかを測定する。モデル全体では、正当性のみのベンチマークでは得られない証明品質にかなりの差がある。また、証明品質の指標と正当性の間に大きなトレードオフが見られ、数学的推論の今後の評価は、LCMの生成した証明がいかに有用かを測定するべきであることを示唆している。

論文の概要: Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness

関連論文リスト