Fugu-MT 論文翻訳(概要): ConvexBench: Can LLMs Recognize Convex Functions?

論文の概要: ConvexBench: Can LLMs Recognize Convex Functions?

arxiv url: http://arxiv.org/abs/2602.01075v2
Date: Wed, 04 Feb 2026 08:09:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-05 15:07:33.699518
Title: ConvexBench: Can LLMs Recognize Convex Functions?
Title（参考訳）: ConvexBench: LLMは凸関数を認識するか?
Authors: Yepeng Liu, Yu Huang, Yu-Xiang Wang, Yingbin Liang, Yuheng Bu,
Abstract要約: 凸解析は数学の現代的な分野であり、多くの応用がある。大規模言語モデル(LLM)が研究レベルの数学と科学を自動化し始めるにつれ、LLMが凸性を理解し、推論する能力を示すことが重要である。我々は,LLMが深い機能的構成下での象徴的対象の凸性を識別できるかどうかを,スケーラブルで機械的に検証できるベンチマークであるcbを紹介する。
参考スコア（独自算出の注目度）: 70.53167848190624
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Convex analysis is a modern branch of mathematics with many applications. As Large Language Models (LLMs) start to automate research-level math and sciences, it is important for LLMs to demonstrate the ability to understand and reason with convexity. We introduce \cb, a scalable and mechanically verifiable benchmark for testing \textit{whether LLMs can identify the convexity of a symbolic objective under deep functional composition.} Experiments on frontier LLMs reveal a sharp compositional reasoning gap: performance degrades rapidly with increasing depth, dropping from an F1-score of $1.0$ at depth $2$ to approximately $0.2$ at depth $100$. Inspection of models' reasoning traces indicates two failure modes: \textit{parsing failure} and \textit{lazy reasoning}. To address these limitations, we propose an agentic divide-and-conquer framework that (i) offloads parsing to an external tool to construct an abstract syntax tree (AST) and (ii) enforces recursive reasoning over each intermediate sub-expression with focused context. This framework reliably mitigates deep-composition failures, achieving substantial performance improvement at large depths (e.g., F1-Score $= 1.0$ at depth $100$).
Abstract（参考訳）: 凸解析は数学の現代的な分野であり、多くの応用がある。大規模言語モデル(LLM)が研究レベルの数学と科学を自動化し始めるにつれ、LLMが凸性を理解し、推論する能力を示すことが重要である。我々は,LLMが深い関数構成の下で記号的対象の凸性を識別できるかどうかを検証するための,スケーラブルで機械的に検証可能なベンチマークである \cb を紹介した。 F1スコアの深さ2ドルから深さ100ドル程度まで下げ、F1スコアの深さ2ドルから約0.2ドルへと下げます。モデルの推論トレースの検査は、2つの障害モードを示す: \textit{parsing failure} と \textit{lazy reasoning}。これらの制約に対処するため,エージェント型分割・問い合わせフレームワークを提案する。 (i)抽象構文木(AST)を構築するために外部ツールに解析をオフロードし、 (ii) 集中した文脈で各中間部分表現に対して再帰的推論を実施する。このフレームワークは、ディープコンポジションの失敗を確実に軽減し、大幅なパフォーマンス向上を実現します(例えば、F1-Score $=1.0$ at depth 100$)。

論文の概要: ConvexBench: Can LLMs Recognize Convex Functions?

関連論文リスト