Related papers: LogiPlan: A Structured Benchmark for Logical Planning and Relational Reasoning in LLMs

LogiPlan: A Structured Benchmark for Logical Planning and Relational Reasoning in LLMs

URL: http://arxiv.org/abs/2506.10527v1
Date: Thu, 12 Jun 2025 09:47:02 GMT
Title: LogiPlan: A Structured Benchmark for Logical Planning and Relational Reasoning in LLMs
Authors: Yanan Cai, Ahmed Salem, Besmira Nushi, Mark Russinovich,
Abstract summary: LogiPlan is a benchmark designed to evaluate the capabilities of large language models (LLMs) in logical planning and reasoning over complex relational structures.<n>We evaluate state-of-the-art models including DeepSeek R1, Gemini 2.0 Pro, Gemini 2 Flash Thinking, GPT-4.5, GPT-4o, Llama 3.1 405B, O3-mini, O1, and Claude 3.7 Sonnet across three tasks.
Score: 7.012555483275226
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We introduce LogiPlan, a novel benchmark designed to evaluate the capabilities of large language models (LLMs) in logical planning and reasoning over complex relational structures. Logical relational reasoning is important for applications that may rely on LLMs to generate and query structured graphs of relations such as network infrastructure, knowledge bases, or business process schema. Our framework allows for dynamic variation of task complexity by controlling the number of objects, relations, and the minimum depth of relational chains, providing a fine-grained assessment of model performance across difficulty levels. LogiPlan encompasses three complementary tasks: (1) Plan Generation, where models must construct valid directed relational graphs meeting specified structural constraints; (2) Consistency Detection, testing models' ability to identify inconsistencies in relational structures; and (3) Comparison Question, evaluating models' capacity to determine the validity of queried relationships within a given graph. Additionally, we assess models' self-correction capabilities by prompting them to verify and refine their initial solutions. We evaluate state-of-the-art models including DeepSeek R1, Gemini 2.0 Pro, Gemini 2 Flash Thinking, GPT-4.5, GPT-4o, Llama 3.1 405B, O3-mini, O1, and Claude 3.7 Sonnet across these tasks, revealing significant performance gaps that correlate with model scale and architecture. Our analysis demonstrates that while recent reasoning-enhanced models show promising results on simpler instances, they struggle with more complex configurations requiring deeper logical planning.

Related papers

Beyond Natural Language Plans: Structure-Aware Planning for Query-Focused Table Summarization [21.1381898110636]
We introduce a new structured plan, TaSoF, inspired by formalism in traditional multi-agent systems, and a framework, SPaGe, that formalizes the reasoning process in three phases.<n> Experiments on three public benchmarks show that SPaGe consistently outperforms prior models in both single- and multi-table settings.
arXiv Detail & Related papers (2025-07-30T16:42:19Z)
Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation [72.44384066166147]
Multi-agent systems (MAS) based on large language models (LLMs) have emerged as a powerful solution for dealing with complex problems across diverse domains.<n>Existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard-coded interaction structures.<n>We propose ARG-Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch.
arXiv Detail & Related papers (2025-07-24T09:17:41Z)
Beyond Model Base Selection: Weaving Knowledge to Master Fine-grained Neural Network Design [20.31388126105889]
We propose M-DESIGN, a curated model knowledge base (MKB) pipeline for mastering neural network refinement.<n>First, we propose a knowledge weaving engine that reframes model refinement as an adaptive query problem over task metadata.<n>Given a user's task query, M-DESIGN quickly matches and iteratively refines candidate models by leveraging a graph-relational knowledge schema.
arXiv Detail & Related papers (2025-07-21T07:49:19Z)
A Pre-training Framework for Relational Data with Information-theoretic Principles [57.93973948947743]
We introduce Task Vector Estimation (TVE), a novel pre-training framework that constructs supervisory signals via set-based aggregation over relational graphs.<n>TVE consistently outperforms traditional pre-training baselines.<n>Our findings advocate for pre-training objectives that encode task heterogeneity and temporal structure as design principles for predictive modeling on relational databases.
arXiv Detail & Related papers (2025-07-14T00:17:21Z)
PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving [66.42260489147617]
We introduce PLAN-TUNING, a framework that distills synthetic task decompositions from large-scale language models.<n>Plan-TUNING fine-tunes smaller models via supervised and reinforcement-learning objectives to improve complex reasoning.<n>Our analysis demonstrates how planning trajectories improves complex reasoning capabilities.
arXiv Detail & Related papers (2025-07-10T07:30:44Z)
Relational Deep Learning: Challenges, Foundations and Next-Generation Architectures [50.46688111973999]
Graph machine learning has led to a significant increase in the capabilities of models that learn on arbitrary graph-structured data.<n>We present a new blueprint that enables end-to-end representation of'relational entity graphs' without traditional engineering feature.<n>We discuss key challenges including large-scale multi-table integration and the complexities of modeling temporal dynamics and heterogeneous data.
arXiv Detail & Related papers (2025-06-19T23:51:38Z)
Comparative Analysis of AI Agent Architectures for Entity Relationship Classification [1.6887793771613606]
In this study, we conduct a comparative analysis of three distinct AI agent architectures to perform relation classification.<n>The agentic architectures explored include (1) reflective self-evaluation, (2) hierarchical task decomposition, and (3) a novel multi-agent dynamic example generation mechanism.<n>Our experiments demonstrate that multi-agent coordination consistently outperforms standard few-shot prompting.
arXiv Detail & Related papers (2025-06-03T04:19:47Z)
DSR-Bench: Evaluating the Structural Reasoning Abilities of LLMs via Data Structures [20.596558700597644]
Large language models (LLMs) are increasingly deployed for real-world tasks that fundamentally involve data manipulation.<n>A core requirement is the ability to perform structural reasoning--that is, to understand and reason about data relationships.<n>We introduce DSR-Bench, a novel benchmark evaluating LLMs' structural reasoning capabilities through data structures.
arXiv Detail & Related papers (2025-05-29T23:24:53Z)
KG-QAGen: A Knowledge-Graph-Based Framework for Systematic Question Generation and Long-Context LLM Evaluation [3.618621510356872]
KG-QAGen is a framework that extracts QA pairs at multiple complexity levels.<n>We construct a dataset of 20,139 QA pairs and open-source a part of it.<n>We evaluate 13 proprietary and open-source LLMs and observe that even the best-performing models are struggling with set-based comparisons.
arXiv Detail & Related papers (2025-05-18T16:46:39Z)
Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs [57.66267515456075]
Large Language Models (LLMs) and Large Multimodal Models (LMMs) predominantly reason through textual representations.<n>We propose a zero-shot fully automatic framework that enables LMMs to reason through multiple chains of self-generated conceptual diagrams.
arXiv Detail & Related papers (2025-03-14T18:27:02Z)
Matchmaker: Self-Improving Large Language Model Programs for Schema Matching [60.23571456538149]
We propose a compositional language model program for schema matching, comprised of candidate generation, refinement and confidence scoring. Matchmaker self-improves in a zero-shot manner without the need for labeled demonstrations. Empirically, we demonstrate on real-world medical schema matching benchmarks that Matchmaker outperforms previous ML-based approaches.
arXiv Detail & Related papers (2024-10-31T16:34:03Z)
On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability [59.72892401927283]
We evaluate the planning capabilities of OpenAI's o1 models across a variety of benchmark tasks. Our results reveal that o1-preview outperforms GPT-4 in adhering to task constraints.
arXiv Detail & Related papers (2024-09-30T03:58:43Z)
Guiding Language Model Reasoning with Planning Tokens [122.43639723387516]
Large language models (LLMs) have recently attracted considerable interest for their ability to perform complex reasoning tasks. We propose a hierarchical generation scheme to encourage a more structural generation of chain-of-thought steps. Our approach requires a negligible increase in trainable parameters (0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme.
arXiv Detail & Related papers (2023-10-09T13:29:37Z)
Query Structure Modeling for Inductive Logical Reasoning Over Knowledge Graphs [67.043747188954]
We propose a structure-modeled textual encoding framework for inductive logical reasoning over KGs. It encodes linearized query structures and entities using pre-trained language models to find answers. We conduct experiments on two inductive logical reasoning datasets and three transductive datasets.
arXiv Detail & Related papers (2023-05-23T01:25:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.