Related papers: Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks

Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks

URL: http://arxiv.org/abs/2601.13392v1
Date: Mon, 19 Jan 2026 21:00:31 GMT
Title: Beyond Memorization: Testing LLM Reasoning on Unseen Theory of Computation Tasks
Authors: Shlok Shelat, Jay Raval, Souvik Roy, Manas Gaur,
Abstract summary: Large language models (LLMs) have demonstrated strong performance on formal language tasks.<n>We introduce a benchmark for deterministic finite automata (DFA) construction from regular languages.<n>We show that models achieve perfect accuracy on factual questions and 84-90% on seen tasks, but accuracy drops sharply on unseen problems.
Score: 8.210112631285666
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have demonstrated strong performance on formal language tasks, yet whether this reflects genuine symbolic reasoning or pattern matching on familiar constructions remains unclear. We introduce a benchmark for deterministic finite automata (DFA) construction from regular languages, comprising factual knowledge questions, seen construction problems from public sources, and two types of unseen problems: hand-crafted instances with multiple interacting constraints and systematically generated problems via Arden's theorem. Models achieve perfect accuracy on factual questions and 84-90% on seen tasks. However, accuracy drops sharply on unseen problems (by 30-64%), with failures stemming from systematic misinterpretation of language constraints, incorrect handling of Kleene-star semantics, and a failure to preserve global consistency. We evaluate a three-stage hint protocol that enables correction of shallow errors but does not reliably resolve globally inconsistent or structurally flawed automata. Our analysis across multiple prompting strategies (direct, Chain-of-Thought, Tree-of-Thought) reveals that errors persist regardless of prompting approach, exposing a fundamental gap between LLMs' ability to generate syntactically plausible DFAs and their capacity for semantically correct formal reasoning.

Related papers

ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization [6.572539312871392]
Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk.<n>We introduce ReLoop, addressing silent failures from two complementary directions.
arXiv Detail & Related papers (2026-02-17T20:20:33Z)
LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse [0.18268488712787334]
Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale.<n>We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model's own predictions.<n>We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning.
arXiv Detail & Related papers (2026-02-10T14:38:13Z)
Decompose-and-Formalise: Recursively Verifiable Natural Language Inference [34.505373718498014]
Large language models (LLMs) with theorem provers (TPs) in neuro-symbolic pipelines helps with entailment verification and proof-guided refinement of explanations for natural language inference (NLI)<n>We propose a decompose-and-formalise framework that decomposes premise-hypothesis pairs into an entailment tree of atomic steps.<n>We also introduce $$-substitution in an event-based logical form to enforce consistent argument-role bindings.
arXiv Detail & Related papers (2026-01-27T13:43:30Z)
Learning the Boundary of Solvability: Aligning LLMs to Detect Unsolvable Problems [51.62477754641947]
We propose UnsolvableQA and UnsolvableRL to solve feasible problems, detect inherent contradictions, and prudently refuse tasks beyond capability.<n>Specifically, we construct UnsolvableQA, a dataset of paired solvable and unsolvable instances derived via a dual-track methodology.<n>Building on this dataset, we introduce UnsolvableRL, a reinforcement learning framework with three reward components jointly accounting for accuracy, unsolvability, and difficulty.
arXiv Detail & Related papers (2025-12-01T13:32:59Z)
Are Language Models Efficient Reasoners? A Perspective from Logic Programming [109.47572890883248]
Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency.<n>We propose a framework for assessing LM reasoning efficiency through the lens of logic programming.
arXiv Detail & Related papers (2025-10-29T15:30:31Z)
ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization [73.0780809974414]
We propose a Reflective Autoformalization method that integrates semantic consistency evaluation into the autoformalization process.<n>This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors.<n>Experiments show that ReForm achieves an average improvement of 22.6 percentage points over the strongest baselines.
arXiv Detail & Related papers (2025-10-28T16:22:54Z)
Syntactic Blind Spots: How Misalignment Leads to LLMs Mathematical Errors [11.169118114200307]
Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities but frequently fail on problems that deviate syntactically from their training distribution.<n>We identify a systematic failure mode, syntactic blind spots, in which models misapply familiar reasoning strategies to problems that are semantically straightforward but phrased in unfamiliar ways.<n>Our findings suggest that many reasoning errors stem from structural misalignment rather than conceptual difficulty, and that syntax-aware interventions can reveal and mitigate these inductive failures.
arXiv Detail & Related papers (2025-10-02T09:26:26Z)
Chain-of-Code Collapse: Reasoning Failures in LLMs via Adversarial Prompting in Code Generation [0.3495246564946556]
Large Language Models (LLMs) have achieved remarkable success in tasks requiring complex reasoning.<n>Do these models truly reason, or do they merely exploit shallow statistical patterns?<n>We introduce Chain-of-Code Collapse, where we investigate the robustness of reasoning LLMs by introducing a suite of semantically faithful yet adversarially structured prompt perturbations.
arXiv Detail & Related papers (2025-06-08T02:43:46Z)
Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models [11.379764847748378]
Large language models (LLMs) often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs.<n>This emphasizes the significance of possessing the textbfPremise Critique Ability for LLMs, defined as the capacity to proactively identify and articulate errors in input premises.<n>We introduce the textbfPremise Critique Bench (PCBench), designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics.
arXiv Detail & Related papers (2025-05-29T17:49:44Z)
Localizing Factual Inconsistencies in Attributable Text Generation [74.11403803488643]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.<n>We show that QASemConsistency yields factual consistency scores that correlate well with human judgments.
arXiv Detail & Related papers (2024-10-09T22:53:48Z)
Uncertainty Quantification for In-Context Learning of Large Language Models [52.891205009620364]
In-context learning has emerged as a groundbreaking ability of Large Language Models (LLMs) We propose a novel formulation and corresponding estimation method to quantify both types of uncertainties. The proposed method offers an unsupervised way to understand the prediction of in-context learning in a plug-and-play fashion.
arXiv Detail & Related papers (2024-02-15T18:46:24Z)
WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models [35.088946378980914]
We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat) We show that these models make errors even with as few as three objects. Errors persist even with chain-of-thought prompting and in-context learning.
arXiv Detail & Related papers (2023-11-27T15:38:17Z)
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability. In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling. Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial. We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.