Related papers: MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs

URL: http://arxiv.org/abs/2410.13502v3
Date: Fri, 14 Feb 2025 18:15:01 GMT
Title: MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
Authors: Andreas Opedal, Haruki Shirakami, Bernhard Schölkopf, Abulhair Saparov, Mrinmaya Sachan,
Abstract summary: MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure.<n>Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider.
Score: 80.96119560172224
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large language models (LLMs) can solve arithmetic word problems with high accuracy, but little is known about how well they generalize to more complex problems. This is difficult to study, as (i) much of the available evaluation data has already been seen by the most capable models during training, and (ii) existing benchmarks do not capture how problem proofs may be arbitrarily complex in various ways. In this paper, we present a data-generation framework for evaluating LLMs on problems with arbitrarily complex arithmetic proofs, called MathGAP. MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure, enabling systematic studies on easy-to-hard generalization with respect to complexity of proof trees. Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider. This effect is more pronounced in complex, nonlinear proof structures, which are challenging even for the most capable models. The models are also sensitive to simple changes in sentence ordering. However, they remain capable of solving some complex problems, suggesting that reasoning generalization is noisy.

Related papers

Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs. OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models [59.920971312822736]
We introduce PromptCoT, a novel approach for automatically generating high-quality Olympiad-level math problems. The proposed method synthesizes complex problems based on mathematical concepts and the rationale behind problem construction. Our method is evaluated on standard benchmarks including GSM8K, MATH-500, and AIME2024, where it consistently outperforms existing problem generation methods.
arXiv Detail & Related papers (2025-03-04T06:32:30Z)
MathConstruct: Challenging LLM Reasoning with Constructive Proofs [0.9320657506524149]
mc is a new benchmark of 126 challenging problems sourced from various math competitions. mc is suitable for Large Language Models evaluation, as solution correctness can be easily verified.
arXiv Detail & Related papers (2025-02-14T14:44:22Z)
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations [90.07275414500154]
We observe significant performance drops on MATH-P-Hard across various models. We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills.
arXiv Detail & Related papers (2025-02-10T13:31:46Z)
GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity? [37.399561533852506]
We develop a grade school math problem generator capable of producing arithmetic problems with infinite difficulty and context length under fine-grained control. We find a consistent sigmoid decline in reasoning performance as complexity increases, along with a systematic inference scaling trend.
arXiv Detail & Related papers (2025-02-07T17:05:25Z)
Unraveling Arithmetic in Large Language Models: The Role of Algebraic Structures [2.8311048083168657]
Large language models (LLMs) have demonstrated remarkable mathematical capabilities, largely driven by chain-of-thought (CoT) prompting. We propose that LLMs learn arithmetic by capturing algebraic structures, such as commutativity and identity properties.
arXiv Detail & Related papers (2024-11-25T10:23:11Z)
LogicPro: Improving Complex Logical Reasoning via Program-Guided Learning [23.987059076950622]
We present a novel approach, called LogicPro, to enhance Large Language Models (LLMs) complex Logical reasoning through Program Examples. We do this effectively by simply utilizing widely available algorithmic problems and their code solutions. Our approach achieves significant improvements in multiple models for the BBH$27$, GSM8K, HellSwag, Logicqa, Reclor, and RTE datasets.
arXiv Detail & Related papers (2024-09-19T17:30:45Z)
Chain of Thoughtlessness? An Analysis of CoT in Planning [17.329365493094542]
Large language model (LLM) performance on reasoning problems typically does not generalize out of distribution. This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain. We find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class.
arXiv Detail & Related papers (2024-05-08T02:48:28Z)
MUSTARD: Mastering Uniform Synthesis of Theorem and Proof Data [85.50740598523818]
MUSTARD is a framework that masters uniform synthesis of theorem and proof data of high quality and diversity. We present a theorem-and-proof benchmark MUSTARDSAUCE with 5,866 valid data points. We perform extensive analysis and demonstrate that MUSTARD generates validated high-quality step-by-step data.
arXiv Detail & Related papers (2024-02-14T05:57:58Z)
CHAMP: A Competition-level Dataset for Fine-Grained Analyses of LLMs' Mathematical Reasoning Capabilities [25.857946070979576]
Concept and Hint-Annotated Math Problems (CHAMP) consists of high school math competition problems annotated with concepts. This benchmark is difficult, with the best model only scoring 58.1% in standard settings. We find that models often arrive at the correct final answer through wrong reasoning steps.
arXiv Detail & Related papers (2024-01-13T03:18:16Z)
Faith and Fate: Limits of Transformers on Compositionality [109.79516190693415]
We investigate the limits of transformer large language models across three representative compositional tasks. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching.
arXiv Detail & Related papers (2023-05-29T23:24:14Z)
Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples [36.63316546586304]
Large language models (LLMs) possess some abstract deductive reasoning ability given chain-of-thought prompts. We test on a broad set of deduction rules and measure their ability to generalize to more complex proofs from simpler demonstrations. Experiments on four LLMs of various sizes and training objectives show that they are able to generalize to compositional proofs.
arXiv Detail & Related papers (2023-05-24T15:55:51Z)
Successive Prompting for Decomposing Complex Questions [50.00659445976735]
Recent works leverage the capabilities of large language models (LMs) to perform complex question answering in a few-shot setting. We introduce Successive Prompting'', where we iteratively break down a complex task into a simple task, solve it, and then repeat the process until we get the final solution. Our best model (with successive prompting) achieves an improvement of 5% absolute F1 on a few-shot version of the DROP dataset.
arXiv Detail & Related papers (2022-12-08T06:03:38Z)
Shortcomings of Question Answering Based Factuality Frameworks for Error Localization [51.01957350348377]
We show that question answering (QA)-based factuality metrics fail to correctly identify error spans in generated summaries. Our analysis reveals a major reason for such poor localization: questions generated by the QG module often inherit errors from non-factual summaries which are then propagated further into downstream modules. Our experiments conclusively show that there exist fundamental issues with localization using the QA framework which cannot be fixed solely by stronger QA and QG models.
arXiv Detail & Related papers (2022-10-13T05:23:38Z)
Amortized Inference for Causal Structure Learning [72.84105256353801]
Learning causal structure poses a search problem that typically involves evaluating structures using a score or independence test. We train a variational inference model to predict the causal structure from observational/interventional data. Our models exhibit robust generalization capabilities under substantial distribution shift.
arXiv Detail & Related papers (2022-05-25T17:37:08Z)
Formal Mathematics Statement Curriculum Learning [64.45821687940946]
We show that at same compute budget, expert iteration, by which we mean proof search interleaved with learning, dramatically outperforms proof search only. We also observe that when applied to a collection of formal statements of sufficiently varied difficulty, expert iteration is capable of finding and solving a curriculum of increasingly difficult problems.
arXiv Detail & Related papers (2022-02-03T00:17:00Z)
A tetrachotomy of ontology-mediated queries with a covering axiom [1.749935196721634]
Our concern is the problem of efficiently determining the data complexity of answering queries mediated by description and their optimal rewritings to standard database queries. We focus on Boolean conjunctive-mediated queries called disjunctive sirups (or d-sirups) Some d-sirups only have exponential-size resolution features, some only double-exponential-size positive existential existential-rewritings and single-exprecursive datalog rewritings.
arXiv Detail & Related papers (2020-06-07T14:47:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.