Syntactic Blind Spots: How Misalignment Leads to LLMs Mathematical Errors
- URL: http://arxiv.org/abs/2510.01831v1
- Date: Thu, 02 Oct 2025 09:26:26 GMT
- Title: Syntactic Blind Spots: How Misalignment Leads to LLMs Mathematical Errors
- Authors: Dane Williamson, Yangfeng Ji, Matthew Dwyer,
- Abstract summary: Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution.<n>We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways.<n>Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.
- Score: 11.169118114200307
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution. We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways. These errors are not due to gaps in mathematical competence, but rather reflect a brittle coupling between surface form and internal representation. To test this, we rephrase incorrectly answered questions using syntactic templates drawn from correct examples. These rephrasings, which preserve semantics while reducing structural complexity, often lead to correct answers. We quantify syntactic complexity using a metric based on Dependency Locality Theory (DLT), and show that higher DLT scores are associated with increased failure rates across multiple datasets. Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.
Related papers
- From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics [79.81905350372067]
We study gap through contextual mathematical reasoning.<n>We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings.<n>Open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20.
arXiv Detail & Related papers (2026-01-30T14:56:04Z) - Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks [8.210112631285666]
Large language models (LLMs) have demonstrated strong performance on formal language tasks.<n>We introduce a benchmark for deterministic finite automata (DFA) construction from regular languages.<n>We show that models achieve perfect accuracy on factual questions and 84-90% on seen tasks, but accuracy drops sharply on unseen problems.
arXiv Detail & Related papers (2026-01-19T21:00:31Z) - Less is More Tokens: Efficient Math Reasoning via Difficulty-Aware Chain-of-Thought Distillation [82.2288581878096]
We present a framework for difficulty-aware reasoning that teaches models to dynamically adjust reasoning depth based on problem complexity.<n>We show that models can be endowed with such dynamic inference pathways without any architectural modifications.
arXiv Detail & Related papers (2025-09-05T16:40:13Z) - Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities [45.8704193793732]
LLM-based formal proof assistants (e.g., in Lean) hold great promise for automating mathematical discovery.<n>Do these systems truly understand mathematical structure as humans do?<n>We investigate this question through the lens of mathematical inequalities.
arXiv Detail & Related papers (2025-05-19T03:56:05Z) - Large Language and Reasoning Models are Shallow Disjunctive Reasoners [15.56445409535547]
Large Language Models (LLMs) have been found to struggle with systematic reasoning.<n>This paper focuses on tasks that require systematic relational composition for qualitative spatial and temporal reasoning.<n>We find that, zero-shot LRMs generally outperform their LLM counterparts in single-path reasoning tasks but struggle in the multi-path setting.
arXiv Detail & Related papers (2025-03-30T15:41:55Z) - Large Language Models and Mathematical Reasoning Failures [1.6114012813668932]
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems.<n>We rigorously analyze both final answers and solution steps to identify reasoning failures.<n>We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
arXiv Detail & Related papers (2025-02-17T09:07:32Z) - MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs [80.96119560172224]
MathGAP generates problem statements and chain-of-thought reasoning traces according to specifications about their arithmetic proof structure.<n>Using MathGAP, we find that LLMs show a significant decrease in performance as proofs get deeper and wider.
arXiv Detail & Related papers (2024-10-17T12:48:14Z) - Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)<n>RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation.<n>Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z) - Large Language Models Struggle with Unreasonability in Math Problems [41.970853209666224]
Large Language Models (LLMs) have shown remarkable success on a wide range of math and reasoning benchmarks.<n>We observe that they often struggle when faced with unreasonable math problems.<n>We propose the textbfUnreasonable Math Problems (UMP) benchmark, designed to evaluate LLMs' ability to detect and respond to unreasonable math problem statements.
arXiv Detail & Related papers (2024-03-28T12:04:28Z) - GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers [68.77382332826167]
Large language models (LLMs) have achieved impressive performance across various mathematical reasoning benchmarks.
One essential and frequently occurring evidence is that when the math questions are slightly changed, LLMs can behave incorrectly.
This motivates us to evaluate the robustness of LLMs' math reasoning capability by testing a wide range of question variations.
arXiv Detail & Related papers (2024-02-29T15:26:14Z) - Exposing Attention Glitches with Flip-Flop Language Modeling [55.0688535574859]
This work identifies and analyzes the phenomenon of attention glitches in large language models.
We introduce flip-flop language modeling (FFLM), a family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models.
We find that Transformer FFLMs suffer from a long tail of sporadic reasoning errors, some of which we can eliminate using various regularization techniques.
arXiv Detail & Related papers (2023-06-01T17:44:35Z) - Faith and Fate: Limits of Transformers on Compositionality [109.79516190693415]
We investigate the limits of transformer large language models across three representative compositional tasks.
These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer.
Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching.
arXiv Detail & Related papers (2023-05-29T23:24:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.