Fugu-MT 論文翻訳(概要): ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization

論文の概要: ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization

arxiv url: http://arxiv.org/abs/2602.22465v1
Date: Wed, 25 Feb 2026 22:54:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-27 18:41:22.440452
Title: ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization
Title（参考訳）: ConstraintBench: 直接最適化に基づく LLM Constraint Reasoning ベンチマーク
Authors: Joseph Tso, Preston Schmittou, Quan Huynh, Jibran Hutchins,
Abstract要約: ConstraintBenchは、直接制約付き最適化において、大きな言語モデルを評価するためのベンチマークである。 200のタスクで6つのフロンティアモデルを評価し、最適性ではなく実現可能性が主要なボトルネックであることを確認した。解法基準の0.1%の範囲内で、結合実現可能性と最適性について30.5%を超えるモデルはない。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are increasingly applied to operational decision-making where the underlying structure is constrained optimization. Existing benchmarks evaluate whether LLMs can formulate optimization problems as solver code, but leave open a complementary question. Can LLMs directly produce correct solutions to fully specified constrained optimization problems without access to a solver? We introduce ConstraintBench, a benchmark for evaluating LLMs on direct constrained optimization across 10 operations research domains, with all ground-truth solutions verified by the Gurobi solver. Each task presents a natural-language scenario with entities, constraints, and an optimization objective; the model must return a structured solution that a deterministic verifier checks against every constraint and the solver-proven optimum. We evaluate six frontier models on 200 tasks and find that feasibility, not optimality, is the primary bottleneck. The best model achieves only 65.0% constraint satisfaction, yet feasible solutions average 89 to 96% of the Gurobi-optimal objective. No model exceeds 30.5% on joint feasibility and optimality within 0.1% of the solver reference. Per-domain analysis shows large variation in difficulty, with average feasibility spanning from 83.3% in the production mix domain to 0.8% in the crew assignment domain. Further, systematic failure modes include duration constraint misunderstanding, entity hallucination, and a feasibility-optimality decoupling in facility location and vehicle routing where models achieve high feasibility but 0% optimality. ConstraintBench and all evaluation infrastructure will be publicly released.
Abstract（参考訳）: 大きな言語モデルは、基盤となる構造が制約付き最適化である運用上の意思決定にますます適用されています。既存のベンチマークでは、LLMが最適化問題をソルバコードとして定式化できるかどうかが評価されているが、相補的な疑問は残る。 LLMは、解決者にアクセスすることなく、完全に規定された制約付き最適化問題に対する正しい解を直接生成できるのか? 提案手法は,Gurobiソルバによって検証されたすべての接地トラバス解を用いて,10の演算領域にまたがる直接制約付き最適化のLLMを評価するためのベンチマークである。各タスクは、エンティティ、制約、最適化目標を備えた自然言語シナリオを提示する。モデルは、決定論的検証器がすべての制約と解法証明の最適値に対してチェックする構造化されたソリューションを返す必要がある。 200のタスクで6つのフロンティアモデルを評価し、最適性ではなく実現可能性が主要なボトルネックであることを確認した。最適モデルは65.0%の制約満足度しか達成しないが、実現可能な解はグロビ最適目標の89～96%である。解法基準の0.1%の範囲内で、結合実現可能性と最適性について30.5%を超えるモデルはない。ドメインごとの分析では、生産用ミキシングドメインの83.3%から乗組員割り当てドメインの0.8%まで、難易度が大きく変化している。さらに、系統的障害モードには、持続的制約誤解、エンティティ幻覚、施設位置と車両ルーティングにおける実現可能性-最適分離が含まれ、モデルが高い実現性を実現するが、最適度は0%である。 ConstraintBenchとすべての評価インフラストラクチャが公開される。

論文の概要: ConstraintBench: Benchmarking LLM Constraint Reasoning on Direct Optimization

関連論文リスト