Related papers: RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems

RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems

URL: http://arxiv.org/abs/2510.09227v1
Date: Fri, 10 Oct 2025 10:13:47 GMT
Title: RegexPSPACE: A Benchmark for Evaluating LLM Reasoning on PSPACE-complete Regex Problems
Authors: Hyundong Jin, Joonghyuk Hahn, Yo-Sub Han,
Abstract summary: Large language models (LLMs) show strong performance across natural language processing (NLP), mathematical reasoning, and programming.<n>We introduce a novel benchmark grounded in two PSPACE-complete regular expression (regex) problems: equivalence decision (RegexEQ) and minimization (RegexMin)<n>We conduct extensive evaluations on 6 LLMs and 5 LRMs of varying scales, revealing common failure patterns such as verbosity and repetition.
Score: 9.63813674229442
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models (LLMs) show strong performance across natural language processing (NLP), mathematical reasoning, and programming, and recent large reasoning models (LRMs) further emphasize explicit reasoning. Yet their computational limits, particularly spatial complexity constrained by finite context windows, remain poorly understood. While recent works often focus on problems within the NP complexity class, we push the boundary by introducing a novel benchmark grounded in two PSPACE-complete regular expression (regex) problems: equivalence decision (RegexEQ) and minimization (RegexMin). PSPACE-complete problems serve as a more rigorous standard for assessing computational capacity, as their solutions require massive search space exploration. We perform a double-exponential space exploration to construct a labeled dataset of over a million regex instances with a sound filtering process to build the benchmark. We conduct extensive evaluations on 6 LLMs and 5 LRMs of varying scales, revealing common failure patterns such as verbosity and repetition. With its well-defined structure and quantitative evaluation metrics, this work presents the first empirical investigation into the spatial computational limitations of LLMs and LRMs, offering a new framework for evaluating their advanced reasoning capabilities. Our code is available at https://github.com/hyundong98/RegexPSPACE .

Related papers

Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings [5.30570508258782]
RE2-Bench is a benchmark of 1,101 reasoning problems, including 195 drawn from mature real-world projects.<n>A comprehensive evaluation of six general-purpose and reasoning-oriented LLMs on two widely used code reasoning tasks using RE2-Bench reveals a significant performance drop from Easy to Hard problems.
arXiv Detail & Related papers (2025-12-16T21:12:53Z)
Syzygy of Thoughts: Improving LLM CoT with the Minimal Free Resolution [59.39066657300045]
Chain-of-Thought (CoT) prompting enhances the reasoning of large language models (LLMs) by decomposing problems into sequential steps.<n>We propose Syzygy of Thoughts (SoT)-a novel framework that extends CoT by introducing auxiliary, interrelated reasoning paths.<n>SoT captures deeper logical dependencies, enabling more robust and structured problem-solving.
arXiv Detail & Related papers (2025-04-13T13:35:41Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models [86.45058529521258]
OlymMATH is a novel Olympiad-level mathematical benchmark designed to rigorously test the complex reasoning capabilities of LLMs.<n>OlymMATH features 200 meticulously curated problems, each manually verified and available in parallel English and Chinese versions.
arXiv Detail & Related papers (2025-03-27T11:20:17Z)
GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity? [37.399561533852506]
We develop a grade school math problem generator capable of producing arithmetic problems with infinite difficulty and context length under fine-grained control.<n>We find a consistent sigmoid decline in reasoning performance as complexity increases, along with a systematic inference scaling trend.
arXiv Detail & Related papers (2025-02-07T17:05:25Z)
Tractable Offline Learning of Regular Decision Processes [50.11277112628193]
This work studies offline Reinforcement Learning (RL) in a class of non-Markovian environments called Regular Decision Processes (RDPs) Ins, the unknown dependency of future observations and rewards from the past interactions can be captured experimentally. Many algorithms first reconstruct this unknown dependency using automata learning techniques.
arXiv Detail & Related papers (2024-09-04T14:26:58Z)
Reliable Reasoning Beyond Natural Language [0.047888359248129786]
Large Language models (LLMs) often exhibit limitations in their ability to reason reliably and flexibly. We propose a neurosymbolic approach that prompts LLMs to extract and encode all relevant information from a problem statement as logical code statements. We then use a logic programming language (Prolog) to conduct the iterative computations of explicit deductive reasoning.
arXiv Detail & Related papers (2024-07-16T04:34:18Z)
PECC: Problem Extraction and Coding Challenges [3.287942619833188]
We introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate code. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset.
arXiv Detail & Related papers (2024-04-29T15:02:14Z)
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning [63.80739044622555]
We introduce MuSR, a dataset for evaluating language models on soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning.
arXiv Detail & Related papers (2023-10-24T17:59:20Z)
Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions. We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.