Can LLMs Reason Structurally? An Evaluation via the Lens of Data Structures
- URL: http://arxiv.org/abs/2505.24069v2
- Date: Tue, 14 Oct 2025 01:24:23 GMT
- Title: Can LLMs Reason Structurally? An Evaluation via the Lens of Data Structures
- Authors: Yu He, Yingxi Li, Colin White, Ellen Vitercik,
- Abstract summary: We introduce DSR-Bench, the first benchmark to systematically evaluate large language models' structural reasoning.<n>The benchmark spans 20 data structures, 35 operations, and 4,140 synthetically generated problem instances with minimal contamination.
- Score: 21.390740746718947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) take on increasingly complex tasks, understanding their algorithmic reasoning abilities has become essential. However, existing evaluations focus on distinct and isolated tasks. We propose a unified diagnostic lens: structural reasoning--understanding and manipulating relationships like order, hierarchy, and connectivity. We introduce DSR-Bench, the first benchmark to systematically evaluate LLM structural reasoning through canonical data structures, which serve as interpretable, algorithmically meaningful abstractions. DSR-Bench spans 20 data structures, 35 operations, and 4,140 synthetically generated problem instances with minimal contamination. The benchmark's hierarchical design pinpoints specific failure modes, while its fully automated evaluation ensures objective and consistent assessment. Benchmarking ten state-of-the-art LLMs reveals critical limitations: the top-performing model scores only 0.498 out of 1 on challenging instances. Three additional evaluation suites reveal further weaknesses: models perform poorly on spatial data and natural language scenarios, and fail to reason over their own generated code. DSR-Bench offers a principled diagnostic tool for structural reasoning, helping expose reasoning bottlenecks and guide the development of more capable and reliable LLMs.
Related papers
- Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs [20.82580343824728]
Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks.<n>This saturation stems from the dominance of template-based computation and shallow arithmetic decomposition.<n>We introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning.
arXiv Detail & Related papers (2026-01-31T07:09:17Z) - Pushing LLMs to Their Logical Reasoning Bound: The Role of Data Reasoning Intensity [59.27594125465172]
We introduce Data Reasoning Intensity (DRI), a novel metric that quantifies the latent logical reasoning complexity of samples.<n>We then introduce a re-cognizing optimization strategy that systematically enhances the logical reasoning intensity of training data.
arXiv Detail & Related papers (2025-09-29T14:20:04Z) - CoRe: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks [14.408364047538578]
Large language models (LLMs) have been widely adopted across diverse domains of software engineering.<n>This work presents CORE, a benchmark designed to evaluate LLMs on fundamental static analysis tasks.
arXiv Detail & Related papers (2025-07-03T01:35:58Z) - StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns [7.60350050736492]
Long-term memory is essential for large language models to achieve autonomous intelligence.<n>Existing benchmarks face challenges in evaluating knowledge retention and dynamic sequential reasoning.<n>We propose a novel benchmark framework based on interactive fiction games.
arXiv Detail & Related papers (2025-06-16T10:54:31Z) - LLM-Symbolic Integration for Robust Temporal Tabular Reasoning [69.27153114778748]
We introduce TempTabQA-C, a synthetic dataset designed for systematic and controlled evaluations.<n>This structured approach allows Large Language Models (LLMs) to generate and executesql queries, enhancing generalization and mitigating biases.
arXiv Detail & Related papers (2025-06-06T05:14:04Z) - A Controllable Examination for Long-Context Language Models [62.845852724511964]
This study introduces $textbfLongBioBench, a benchmark for evaluating long-context language models.<n>We show that most models still exhibit deficiencies in semantic understanding and elementary reasoning over retrieved results.<n>Our further analysis indicates some design choices employed by existing synthetic benchmarks, such as contextual non-coherence.
arXiv Detail & Related papers (2025-06-03T14:23:06Z) - IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z) - Structured Prompting and Feedback-Guided Reasoning with LLMs for Data Interpretation [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and task generalization.<n>This paper introduces the STROT Framework, a method for structured prompting and feedback-driven transformation logic generation.
arXiv Detail & Related papers (2025-05-03T00:05:01Z) - HiBench: Benchmarking LLMs Capability on Hierarchical Structure Reasoning [25.088407009353162]
Existing benchmarks for structure reasoning mainly focus on horizontal and coordinate structures.<n>HiBench is the first framework spanning from initial structure generation to final proficiency assessment.<n>It consists of 30 tasks with varying hierarchical complexity, totaling 39,519 queries.
arXiv Detail & Related papers (2025-03-02T14:25:37Z) - FRIDA to the Rescue! Analyzing Synthetic Data Effectiveness in Object-Based Common Sense Reasoning for Disaster Response [19.744969357182665]
We introduce a dataset and pipeline to create Field Reasoning and Instruction Decoding Agent (FRIDA) models.<n>In our pipeline, domain experts and linguists combine their knowledge to make high-quality few-shot prompts.<n>We fine-tune several small instruction-tuned models and find that ablated FRIDA models only trained on objects' physical state and function data.
arXiv Detail & Related papers (2025-02-25T18:51:06Z) - Large Language Models Meet Symbolic Provers for Logical Reasoning Evaluation [24.081573908824353]
First-order logic (FOL) reasoning is pivotal for intelligent systems.<n>Existing benchmarks often rely on extensive human annotation or handcrafted templates.<n>We propose a novel framework called ProverGen that synergizes the generative strengths of Large Language Models with the rigor and precision of symbolic provers.
arXiv Detail & Related papers (2025-02-10T15:31:54Z) - TARGA: Targeted Synthetic Data Generation for Practical Reasoning over Structured Data [9.390415313514762]
TARGA is a framework that generates high-relevance synthetic data without manual annotation.<n>It substantially outperforms existing non-fine-tuned methods that utilize close-sourced model.<n>It exhibits superior sample efficiency, robustness, and generalization capabilities under non-I.I.D. settings.
arXiv Detail & Related papers (2024-12-27T09:16:39Z) - StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z) - MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.<n>Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.<n>We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z) - Make LLMs better zero-shot reasoners: Structure-orientated autonomous reasoning [52.83539473110143]
We introduce a novel structure-oriented analysis method to help Large Language Models (LLMs) better understand a question.
To further improve the reliability in complex question-answering tasks, we propose a multi-agent reasoning system, Structure-oriented Autonomous Reasoning Agents (SARA)
Extensive experiments verify the effectiveness of the proposed reasoning system. Surprisingly, in some cases, the system even surpasses few-shot methods.
arXiv Detail & Related papers (2024-10-18T05:30:33Z) - BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data [61.936320820180875]
Large language models (LLMs) have become increasingly pivotal across various domains.
BabelBench is an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution.
Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement.
arXiv Detail & Related papers (2024-10-01T15:11:24Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning [63.80739044622555]
We introduce MuSR, a dataset for evaluating language models on soft reasoning tasks specified in a natural language narrative.
This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm.
Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning.
arXiv Detail & Related papers (2023-10-24T17:59:20Z) - When Do Program-of-Thoughts Work for Reasoning? [51.2699797837818]
We propose complexity-impacted reasoning score (CIRS) to measure correlation between code and reasoning abilities.
Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity.
Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.
arXiv Detail & Related papers (2023-08-29T17:22:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.