Related papers: MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

URL: http://arxiv.org/abs/2410.13502v2
Date: Mon, 21 Oct 2024 15:58:30 GMT
Title: MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
Authors: Andreas Opedal, Haruki Shirakami, Bernhard Schölkopf, Abulhair Saparov, Mrinmaya Sachan,
Abstract summary: Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to problems that are more complex than the ones on which they have been trained. We present a framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP.
Score: 80.96119560172224
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to problems that are more complex than the ones on which they have been trained. Empirical investigations of such questions are impeded by two major flaws of current evaluations: (i) much of the evaluation data is contaminated, in the sense that it has already been seen during training, and (ii) benchmark datasets do not capture how problem proofs may be arbitrarily complex in various ways. As a step towards addressing these issues, we present a framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problems that follow fixed proof specifications -- along with chain-of-thought reasoning annotations -- enabling systematic studies on generalization with respect to arithmetic proof complexity. We apply MathGAP to analyze how in-context learning interacts with generalization to problems that have more complex proofs. We find that among the models tested, most show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for GPT-4o. Surprisingly, providing in-context examples from the same distribution as the test set is not always beneficial for performance. In particular, zero-shot prompting as well as demonstrating a diverse range of examples that are less complex than the test data sometimes yield similar or higher accuracies.

Related papers

CAMA: Enhancing Mathematical Reasoning in Large Language Models with Causal Knowledge [14.367146529900609]
Large Language Models (LLMs) have demonstrated strong performance across a wide range of tasks, yet they still struggle with complex mathematical reasoning.<n>We propose textbfCAusal textbfMAthematician (textbfCAMA), a two-stage causal framework that equips LLMs with explicit, reusable mathematical structure.
arXiv Detail & Related papers (2025-08-04T16:39:24Z)
Frontier LLMs Still Struggle with Simple Reasoning Tasks [53.497499123166804]
This work studies the performance of frontier language models on a broad set of "easy" reasoning problems.<n>We create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning.<n>We show that even state-of-the-art thinking models consistently fail on such problems and for similar reasons.
arXiv Detail & Related papers (2025-07-09T22:22:49Z)
Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities [45.8704193793732]
LLM-based formal proof assistants (e.g., in Lean) hold great promise for automating mathematical discovery.<n>Do these systems truly understand mathematical structure as humans do?<n>We investigate this question through the lens of mathematical inequalities.
arXiv Detail & Related papers (2025-05-19T03:56:05Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems. The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction. Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z)
MathConstruct: Challenging LLM Reasoning with Constructive Proofs [0.9320657506524149]
mc is a new benchmark of 126 challenging problems sourced from various math competitions. mc is suitable for Large Language Models evaluation, as solution correctness can be easily verified.
arXiv Detail & Related papers (2025-02-14T14:44:22Z)
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models. We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z)
GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity? [37.399561533852506]
We develop a grade school math problem generator capable of producing arithmetic problems with infinite difficulty and context length under fine-grained control. We find a consistent sigmoid decline in reasoning performance as complexity increases, along with a systematic inference scaling trend.
arXiv Detail & Related papers (2025-02-07T17:05:25Z)
Unraveling Arithmetic in Large Language Models: The Role of Algebraic Structures [2.8311048083168657]
Large language models (LLMs) have demonstrated remarkable mathematical capabilities, largely driven by chain-of-thought (CoT) prompting. We propose that LLMs learn arithmetic by capturing algebraic structures, such as commutativity and identity properties.
arXiv Detail & Related papers (2024-11-25T10:23:11Z)
LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning [23.987059076950622]
We present a novel approach, called LogicPro, to enhance Large Language Models (LLMs) complex Logical reasoning through Program Examples. We do this effectively by simply utilizing widely available algorithmic problems and their code solutions. Our approach achieves significant improvements in multiple models for the BBH$27$, GSM8K, HellSwag, Logicqa, Reclor, and RTE datasets.
arXiv Detail & Related papers (2024-09-19T17:30:45Z)
Chain of Thoughtlessness? An Analysis of CoT in Planning [17.329365493094542]
Large language model (LLM) performance on reasoning problems typically does not generalize out of distribution. This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain. We find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class.
arXiv Detail & Related papers (2024-05-08T02:48:28Z)
Large Language Models Struggle with Unreasonability in Math Problems [41.970853209666224]
Large Language Models (LLMs) have shown remarkable success on a wide range of math and reasoning benchmarks.<n>We observe that they often struggle when faced with unreasonable math problems.<n>We propose the textbfUnreasonable Math Problems (UMP) benchmark, designed to evaluate LLMs' ability to detect and respond to unreasonable math problem statements.
arXiv Detail & Related papers (2024-03-28T12:04:28Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities [25.857946070979576]
Concept and Hint-Annotated Math Problems (CHAMP) consists of high school math competition problems annotated with concepts. This benchmark is difficult, with the best model only scoring 58.1% in standard settings. We find that models often arrive at the correct final answer through wrong reasoning steps.
arXiv Detail & Related papers (2024-01-13T03:18:16Z)
Faith and Fate: Limits of Transformers on Compositionality [109.79516190693415]
We investigate the limits of transformer large language models across three representative compositional tasks. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching.
arXiv Detail & Related papers (2023-05-29T23:24:14Z)
Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples [36.63316546586304]
Large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. We test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations. Experiments on four LLMs of various sizes and training objectives show that they are able to generalize to compositional proofs.
arXiv Detail & Related papers (2023-05-24T15:55:51Z)
Successive Prompting for Decomposing Complex Questions [50.00659445976735]
Recent works leverage the capabilities of large language models (LMs) to perform complex question answering in a few-shot setting. We introduce Successive Prompting'', where we iteratively break down a complex task into a simple task, solve it, and then repeat the process until we get the final solution. Our best model (with successive prompting) achieves an improvement of 5% absolute F1 on a few-shot version of the DROP dataset.
arXiv Detail & Related papers (2022-12-08T06:03:38Z)
Shortcomings of Question Answering Based Factuality Frameworks for Error Localization [51.01957350348377]
We show that question answering (QA)-based factuality metrics fail to correctly identify error spans in generated summaries. Our analysis reveals a major reason for such poor localization: questions generated by the QG module often inherit errors from non-factual summaries which are then propagated further into downstream modules. Our experiments conclusively show that there exist fundamental issues with localization using the QA framework which cannot be fixed solely by stronger QA and QG models.
arXiv Detail & Related papers (2022-10-13T05:23:38Z)
Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test. We train a variational inference model to predict the causal structure from observational/interventional data. Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z)
Formal Mathematics Statement Curriculum Learning [64.45821687940946]
We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems.
arXiv Detail & Related papers (2022-02-03T00:17:00Z)
A tetrachotomy of ontology-mediated queries with a covering axiom [1.749935196721634]
Our concern is the problem of efficiently determining the data complexity of answering queries mediated by description and their optimal rewritings to standard database queries. We focus on Boolean conjunctive-mediated queries called disjunctive sirups (or d-sirups) Some d-sirups only have exponential-size resolution features, some only double-exponential-size positive existential existential-rewritings and single-exprecursive datalog rewritings.
arXiv Detail & Related papers (2020-06-07T14:47:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.