DAG-Math: Graph-Guided Mathematical Reasoning in LLMs
- URL: http://arxiv.org/abs/2510.19842v1
- Date: Sun, 19 Oct 2025 21:05:17 GMT
- Title: DAG-Math: Graph-Guided Mathematical Reasoning in LLMs
- Authors: Yuanhe Zhang, Ilja Kuzborskij, Jason D. Lee, Chenlei Leng, Fanghui Liu,
- Abstract summary: Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT)<n>We propose modeling CoT as a certain rule-based process over directed acyclic graphs (DAGs)<n>We introduce logical closeness, a metric that quantifies how well a model's CoT trajectory adheres to the DAG structure.
- Score: 54.231935013127206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) demonstrate strong performance on mathematical problems when prompted with Chain-of-Thought (CoT), yet it remains unclear whether this success stems from search, rote procedures, or rule-consistent reasoning. To address this, we propose modeling CoT as a certain rule-based stochastic process over directed acyclic graphs (DAGs), where nodes represent intermediate derivation states and edges encode rule applications. Within this framework, we introduce logical closeness, a metric that quantifies how well a model's CoT trajectory (i.e., the LLM's final output) adheres to the DAG structure, providing evaluation beyond classical PASS@k metrics. Building on this, we introduce the DAG-MATH CoT format and construct a benchmark that guides LLMs to generate CoT trajectories in this format, thereby enabling the evaluation of their reasoning ability under our framework. Across standard mathematical reasoning datasets, our analysis uncovers statistically significant differences in reasoning fidelity among representative LLM families-even when PASS@k is comparable-highlighting gaps between final-answer accuracy and rule-consistent derivation. Our framework provides a balance between free-form CoT and formal proofs systems, offering actionable diagnostics for LLMs reasoning evaluation. Our benchmark and code are available at: https://github.com/YuanheZ/DAG-MATH-Formatted-CoT.
Related papers
- On Multi-Step Theorem Prediction via Non-Parametric Structural Priors [50.16583672681106]
In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL)<n>We propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference.<n>Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models.
arXiv Detail & Related papers (2026-03-05T06:08:50Z) - Linear-LLM-SCM: Benchmarking LLMs for Coefficient Elicitation in Linear-Gaussian Causal Models [28.281361951823765]
We introduce Linear-LLM-SCM, a plug-and-play benchmarking framework for evaluating large language models (LLMs)<n>We show challenges in such benchmarking tasks, namely, strongity in the results in some of the models and susceptibility to DAG misspecification via spurious edges in the continuous domains.<n>We also open-source the benchmarking framework so that researchers can utilize their DAGs and any off-the-shelf LLMs plug-and-play for evaluation in their domains effortlessly.
arXiv Detail & Related papers (2026-02-10T20:49:01Z) - Concise Geometric Description as a Bridge: Unleashing the Potential of LLM for Plane Geometry Problem Solving [50.05273675575345]
PlaneThought Problem Solving (PGPS) aims to solve a plane geometric problem based on a geometric diagram and problem textual descriptions.<n>Large Language Models (LLMs) possess strong reasoning skills, their direct application to PGPS is hindered by their inability to process visual diagrams.<n>We train a MLLM Interpreter to generate geometric descriptions for the visual diagram, and an off-the-shelf LLM is utilized to perform reasoning.
arXiv Detail & Related papers (2026-01-29T02:03:33Z) - KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning? [4.473915603131591]
Chain-of-thought traces have been shown to improve performance of large language models in a plethora of reasoning tasks.<n>We introduce Causal CoT Graphs (CCGs), which are directed acyclic graphs automatically extracted from reasoning traces.
arXiv Detail & Related papers (2025-07-15T15:28:37Z) - Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLM [11.181783720439563]
Large Language Models (LLMs) display sophisticated reasoning abilities via extended Chain-of-Thought (CoT) generation.<n>RLMs often demonstrate counterintuitive and unstable behaviors, such as performance degradation under few-shot prompting.<n>We introduce a unified graph-based analytical framework for better modeling the reasoning processes of RLMs.
arXiv Detail & Related papers (2025-05-20T03:54:57Z) - Understanding Chain-of-Thought in LLMs through Information Theory [16.78730663293352]
We formalize Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) through an information-theoretic lens.<n>Specifically, our framework quantifies the information-gain' at each reasoning step, enabling the identification of failure modes.<n>We demonstrate the efficacy of our approach through extensive experiments on toy arithmetic, GSM8K and PRM800k datasets.
arXiv Detail & Related papers (2024-11-18T19:14:36Z) - To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning [55.52872152909785]
Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs)<n>We show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks.
arXiv Detail & Related papers (2024-09-18T17:55:00Z) - On the Diagram of Thought [20.805936414171892]
Large Language Models (LLMs) excel at many tasks but often falter on complex problems that require structured, multi-step reasoning.<n>We introduce the Diagram of Thought (DoT), a new framework that enables a single LLM to build and navigate a mental map of its reasoning.
arXiv Detail & Related papers (2024-09-16T07:01:41Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Linear Temporal Logic Modulo Theories over Finite Traces (Extended
Version) [72.38188258853155]
This paper studies Linear Temporal Logic over Finite Traces (LTLf)
proposition letters are replaced with first-order formulas interpreted over arbitrary theories.
The resulting logic, called Satisfiability Modulo Theories (LTLfMT), is semi-decidable.
arXiv Detail & Related papers (2022-04-28T17:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.