Related papers: Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities

Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities

URL: http://arxiv.org/abs/2505.12680v2
Date: Mon, 20 Oct 2025 04:36:02 GMT
Title: Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities
Authors: Haoyu Zhao, Yihan Geng, Shange Tang, Yong Lin, Bohan Lyu, Hongzhou Lin, Chi Jin, Sanjeev Arora,
Abstract summary: We study provers' ability to recognize that the given problem simplifies by applying a known inequality such as AM/GM.<n>Although these problems remain easy for humans, we find that most provers -- including Goedel, rewriting, and Kimina-7B -- struggle significantly.<n>Our results expose a persisting gap between the behavior of current AI provers and human intuition.
Score: 46.111273938884295
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLM-based formal proof assistants (e.g., in Lean) hold great promise for automating mathematical discovery. But beyond syntactic correctness, do these systems truly understand mathematical structure as humans do? We investigate this question in context of mathematical inequalities -- specifically the prover's ability to recognize that the given problem simplifies by applying a known inequality such as AM/GM. Specifically, we are interested in their ability to do this in a compositional setting where multiple inequalities must be applied as part of a solution. We introduce Ineq-Comp, a benchmark built from elementary inequalities through systematic transformations, including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, we find that most provers -- including Goedel, STP, and Kimina-7B -- struggle significantly. DeepSeek-Prover-V2-7B shows relative robustness, but still suffers a 20% performance drop (pass@32). Even for DeepSeek-Prover-V2-671B model, the gap between compositional variants and seed problems exists, implying that simply scaling up the model size alone does not fully solve the compositional weakness. Strikingly, performance remains poor for all models even when formal proofs of the constituent parts are provided in context, revealing that the source of weakness is indeed in compositional reasoning. Our results expose a persisting gap between the generalization behavior of current AI provers and human mathematical intuition. All data and evaluation code can be found at https://github.com/haoyuzhao123/LeanIneqComp.

Related papers

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics [79.81905350372067]
We study gap through contextual mathematical reasoning.<n>We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings.<n>Open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20.
arXiv Detail & Related papers (2026-01-30T14:56:04Z)
Syntactic Blind Spots: How Misalignment Leads to LLMs Mathematical Errors [11.169118114200307]
Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution.<n>We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways.<n>Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.
arXiv Detail & Related papers (2025-10-02T09:26:26Z)
Performative Thinking? The Brittle Correlation Between CoT Length and Problem Complexity [23.225139930889522]
This work critically examines whether intermediate token sequence length reflects or correlates with problem difficulty.<n>We train transformer models from scratch on derivational traces of the A* search algorithm.<n>We find that even for the simplest tasks, they often produce excessively long reasoning traces and sometimes fail to generate a solution.
arXiv Detail & Related papers (2025-09-09T02:31:16Z)
AgentCoMa: A Compositional Benchmark Mixing Commonsense and Mathematical Reasoning in Real-World Scenarios [21.065210731722246]
We introduce an Agentic Commonsense and Math benchmark (AgentCoMa)<n>Each compositional task requires a commonsense reasoning step and a math reasoning step.<n>We find that LLMs can usually solve both steps in isolation, yet their accuracy drops by 30% on average when the two are combined.
arXiv Detail & Related papers (2025-08-27T15:47:19Z)
Solving Inequality Proofs with Large Language Models [46.71658812761115]
Inequality proving is crucial across diverse scientific and mathematical fields.<n>This makes it a demanding frontier for large language models (LLMs)<n>We release IneqMath, an expert-curated dataset of Olympiad-level inequalities.
arXiv Detail & Related papers (2025-06-09T16:43:38Z)
Enhancing Mathematical Reasoning in Large Language Models with Self-Consistency-Based Hallucination Detection [0.0]
We introduce a structured self-consistency framework designed to enhance the reliability of mathematical reasoning.<n>Our method enforces self-consistency across intermediate steps and final outputs, reducing logical inconsistencies and hallucinations.<n> Experimental results demonstrate that SC significantly improves proof validity, symbolic reasoning accuracy, and numerical stability.
arXiv Detail & Related papers (2025-04-13T05:47:52Z)
Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length? [72.70486097967124]
We formalize a framework using deterministic finite automata (DFAs)<n>We show that there exists an optimal amount of reasoning tokens such that the probability of producing a correct solution is maximized.<n>We then demonstrate an implication of these findings: being able to predict the optimal number of reasoning tokens for new problems and filtering out non-optimal length answers results in consistent accuracy improvements.
arXiv Detail & Related papers (2025-04-02T17:45:58Z)
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems.<n>The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction.<n>Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z)
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models.<n>We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z)
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs [80.96119560172224]
MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure.<n>Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider.
arXiv Detail & Related papers (2024-10-17T12:48:14Z)
Artifical intelligence and inherent mathematical difficulty [0.0]
We first present an updated version of a traditional argument that limitative results from computability and complexity theory show that proof discovery is an inherently difficult problem. We then illustrate how several recent applications of artificial intelligence-inspired methods do indeed raise novel questions about the nature of mathematical proof.
arXiv Detail & Related papers (2024-08-01T20:08:31Z)
MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula [33.5782208232163]
We propose Math CAMPS: a method to synthesize high-quality mathematical problems at scale. We encode each standard in a formal grammar, allowing us to sample diverse symbolic problems and their answers. We derive follow-up questions from symbolic structures and convert them into follow-up word problems.
arXiv Detail & Related papers (2024-07-01T01:56:28Z)
A Hybrid System for Systematic Generalization in Simple Arithmetic Problems [70.91780996370326]
We propose a hybrid system capable of solving arithmetic problems that require compositional and systematic reasoning over sequences of symbols. We show that the proposed system can accurately solve nested arithmetical expressions even when trained only on a subset including the simplest cases.
arXiv Detail & Related papers (2023-06-29T18:35:41Z)
Faith and Fate: Limits of Transformers on Compositionality [109.79516190693415]
We investigate the limits of transformer large language models across three representative compositional tasks. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching.
arXiv Detail & Related papers (2023-05-29T23:24:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.