Related papers: ConvexBench: Can LLMs Recognize Convex Functions?

ConvexBench: Can LLMs Recognize Convex Functions?

URL: http://arxiv.org/abs/2602.01075v2
Date: Wed, 04 Feb 2026 08:09:18 GMT
Title: ConvexBench: Can LLMs Recognize Convex Functions?
Authors: Yepeng Liu, Yu Huang, Yu-Xiang Wang, Yingbin Liang, Yuheng Bu,
Abstract summary: Convex analysis is a modern branch of mathematics with many applications.<n>As Large Language Models (LLMs) start to automate research-level math and sciences, it is important for LLMs to demonstrate the ability to understand and reason with convexity.<n>We introduce cb, a scalable and mechanically verifiable benchmark for testing textitwhether LLMs can identify the convexity of a symbolic objective under deep functional composition.
Score: 70.53167848190624
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Convex analysis is a modern branch of mathematics with many applications. As Large Language Models (LLMs) start to automate research-level math and sciences, it is important for LLMs to demonstrate the ability to understand and reason with convexity. We introduce \cb, a scalable and mechanically verifiable benchmark for testing \textit{whether LLMs can identify the convexity of a symbolic objective under deep functional composition.} Experiments on frontier LLMs reveal a sharp compositional reasoning gap: performance degrades rapidly with increasing depth, dropping from an F1-score of $1.0$ at depth $2$ to approximately $0.2$ at depth $100$. Inspection of models' reasoning traces indicates two failure modes: \textit{parsing failure} and \textit{lazy reasoning}. To address these limitations, we propose an agentic divide-and-conquer framework that (i) offloads parsing to an external tool to construct an abstract syntax tree (AST) and (ii) enforces recursive reasoning over each intermediate sub-expression with focused context. This framework reliably mitigates deep-composition failures, achieving substantial performance improvement at large depths (e.g., F1-Score $= 1.0$ at depth $100$).

Related papers

$\ abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z)
Tool Building as a Path to "Superintelligence" [7.762021543059531]
Diligent Learner framework suggests LLMs can achieve superintelligence via test-time search.<n>We design a benchmark to measure $$ on logical out-of-distribution inference.<n>We find that successful reasoning at scale is contingent upon precise tool calls.
arXiv Detail & Related papers (2026-02-24T16:22:10Z)
Hierarchical Evaluation of Software Design Capabilities of Large Language Models of Code [7.897548449569687]
Large language models (LLMs) are increasingly adopted in software engineering domain, yet robustness of their grasp on core design concepts remains unclear.<n>We generate poorly designed software fragments under various levels of guidance.<n> Reasoning about coupling proves brittle; performance collapses in noisy, open-ended scenarios.<n> Reasoning-trace analysis confirms these failure modes, revealing textitcognitive shortcutting for coupling versus a more exhaustive (yet still failing) analysis for cohesion.
arXiv Detail & Related papers (2025-11-25T23:50:00Z)
Don't Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls [83.89771461061903]
Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs)<n>Recent advancements in tree search algorithms guided by verifiers have significantly enhanced the reasoning capabilities of large language models (LLMs)<n>We identify two key challenges contributing to this inefficiency: $textitover-exploration$ due to redundant states with semantically equivalent content, and $textitunder-exploration$ caused by high variance in verifier scoring.<n>We propose FETCH, a flexible, plug-and-play system compatible with various tree search algorithms.
arXiv Detail & Related papers (2025-02-16T16:12:01Z)
FLARE: Faithful Logic-Aided Reasoning and Exploration [47.46564769245296]
We introduce a novel approach for traversing the problem space using task decompositions.<n>We use the Large Language Models to plan a solution, soft-formalise the query into facts and predicates using a logic programming code.<n>Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers.
arXiv Detail & Related papers (2024-10-14T19:39:11Z)
Enumerating Minimal Unsatisfiable Cores of LTLf formulas [8.650929640364593]
Linear Temporal Logic over finite traces ($textLTL_f$) is a widely used formalism with applications in AI, process mining, model checking, and more. This paper introduces a novel technique for enumerating minimal unsatisfiable cores (MUCs) of an $textLTL_f$ specification.
arXiv Detail & Related papers (2024-09-14T17:15:30Z)
Can Large Language Models Play Games? A Case Study of A Self-Play Approach [61.15761840203145]
Large Language Models (LLMs) harness extensive data from the Internet, storing a broad spectrum of prior knowledge. Monte-Carlo Tree Search (MCTS) is a search algorithm that provides reliable decision-making solutions. This work introduces an innovative approach that bolsters LLMs with MCTS self-play to efficiently resolve turn-based zero-sum games.
arXiv Detail & Related papers (2024-03-08T19:16:29Z)
There is no Accuracy-Interpretability Tradeoff in Reinforcement Learning for Mazes [64.05903267230467]
Interpretability is an essential building block for trustworthiness in reinforcement learning systems. We show that in certain cases, one can achieve policy interpretability while maintaining its optimality.
arXiv Detail & Related papers (2022-06-09T04:23:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.