Fugu-MT 論文翻訳(概要): Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation

論文の概要: Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation

arxiv url: http://arxiv.org/abs/2510.16943v1
Date: Sun, 19 Oct 2025 17:47:59 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.217521
Title: Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation
Title（参考訳）: ブラックボックス内部のペアリング:コンポーネントレベル評価による最適化モデルにおけるLLMエラーの発見
Authors: Dania Refai, Moataz Ahmed,
Abstract要約: 大規模言語モデル(LLM)のためのコンポーネントレベル評価フレームワークを提案する。 GPT-5、LLaMA 3.1命令、DeepSeek Mathを様々な複雑さの最適化問題で評価する。その結果、GPT-5は他のモデルよりも一貫して優れており、チェーン・オブ・シンク、自己整合性、モジュール性がより効果的であることを証明している。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly used to convert natural language descriptions into mathematical optimization formulations. Current evaluations often treat formulations as a whole, relying on coarse metrics like solution accuracy or runtime, which obscure structural or numerical errors. In this study, we present a comprehensive, component-level evaluation framework for LLM-generated formulations. Beyond the conventional optimality gap, our framework introduces metrics such as precision and recall for decision variables and constraints, constraint and objective root mean squared error (RMSE), and efficiency indicators based on token usage and latency. We evaluate GPT-5, LLaMA 3.1 Instruct, and DeepSeek Math across optimization problems of varying complexity under six prompting strategies. Results show that GPT-5 consistently outperforms other models, with chain-of-thought, self-consistency, and modular prompting proving most effective. Analysis indicates that solver performance depends primarily on high constraint recall and low constraint RMSE, which together ensure structural correctness and solution reliability. Constraint precision and decision variable metrics play secondary roles, while concise outputs enhance computational efficiency. These findings highlight three principles for NLP-to-optimization modeling: (i) Complete constraint coverage prevents violations, (ii) minimizing constraint RMSE ensures solver-level accuracy, and (iii) concise outputs improve computational efficiency. The proposed framework establishes a foundation for fine-grained, diagnostic evaluation of LLMs in optimization modeling.
Abstract（参考訳）: 大規模言語モデル (LLM) は、自然言語記述を数学的最適化の定式化に変換するためにますます使われている。現在の評価はしばしば定式化を全体として扱い、解の精度や実行時の不明瞭な構造的あるいは数値的な誤りのような粗いメトリクスに依存している。本研究では, LLM 生成定式化のための包括的, コンポーネントレベルの評価フレームワークを提案する。従来の最適性ギャップ以外にも、決定変数と制約の精度とリコール、制約と目的のルート平均二乗誤差(RMSE)、トークンの使用率と遅延に基づく効率指標などの指標を導入している。 GPT-5, LLaMA 3.1 Instruct, and DeepSeek Math across optimization problem of various complexity under six prompting strategy。その結果、GPT-5は他のモデルよりも一貫して優れており、チェーン・オブ・シンク、自己整合性、モジュラー・プロンプトが最も効果的であることが示された。解析によると、解法の性能は、構造的正しさと解の信頼性を確保するために、主に高い制約リコールと低い制約RMSEに依存する。制約精度と決定変数のメトリクスは二次的な役割を担い、簡潔な出力は計算効率を高める。これらの結果は、NLP-to-optimization Modelingの3つの原則を浮き彫りにした。一完全拘束範囲が違反を防止すること。 (II)制約RMSEの最小化はソルバレベルの精度を確保し、三簡潔な出力により、計算効率が向上する。提案するフレームワークは,最適化モデルにおけるLCMの細粒度, 診断的評価の基礎を確立する。

論文の概要: Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation

関連論文リスト