Related papers: ChaosBench-Logic: A Benchmark for Logical and Symbolic Reasoning on Chaotic Dynamical Systems

ChaosBench-Logic: A Benchmark for Logical and Symbolic Reasoning on Chaotic Dynamical Systems

URL: http://arxiv.org/abs/2601.01982v1
Date: Mon, 05 Jan 2026 10:36:40 GMT
Title: ChaosBench-Logic: A Benchmark for Logical and Symbolic Reasoning on Chaotic Dynamical Systems
Authors: Noel Thomas,
Abstract summary: Large language models (LLMs) excel at natural language tasks but remain brittle in domains requiring precise logical and symbolic reasoning.<n>Chaotic dynamical systems provide an especially demanding test because chaos is deterministic yet often misinterpreted as randomness or complexity.<n>We introduce ChaosBench-Logic, a benchmark that evaluates LLM reasoning across 30 diverse dynamical systems.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) excel at natural language tasks but remain brittle in domains requiring precise logical and symbolic reasoning. Chaotic dynamical systems provide an especially demanding test because chaos is deterministic yet often misinterpreted as randomness or complexity. We introduce ChaosBench-Logic, a benchmark that evaluates LLM reasoning across 30 diverse dynamical systems using a unified first-order logic (FOL) ontology. Each system is annotated with truth assignments for 11 semantic predicates, and 621 questions are generated across seven reasoning categories, including multi-hop implications, cross-system analogies, counterfactual reasoning, bias probes, and multi-turn dialogues. We define metrics for logical accuracy, implication consistency, dialogue coherence, and contradiction, and we release an open-source evaluation pipeline. Initial experiments show that frontier LLMs such as GPT-4, Claude 3.5 Sonnet, Gemini 2.5 Flash, and the open-source LLaMA-3 70B achieve 91-94% per-item accuracy, yet still score 0% on compositional items and exhibit fragile global coherence. Dialogue-level accuracy ranges from 53.1% (GPT-4 CoT) to 75.5% (LLaMA-3 zero-shot). ChaosBench-Logic provides a rigorous testbed for diagnosing such failures and a foundation for developing neuro-symbolic approaches that improve scientific reasoning in LLMs.

Related papers

Training LLMs with LogicReward for Faithful and Rigorous Reasoning [75.30425553246177]
We propose LogicReward, a reward system that guides model training by enforcing step-level logical correctness with a theorem prover.<n>An 8B model trained on data constructed with LogicReward surpasses GPT-4o and o4-mini by 11.6% and 2% on natural language inference and logical reasoning tasks.
arXiv Detail & Related papers (2025-12-20T03:43:02Z)
MuSLR: Multimodal Symbolic Logical Reasoning [133.85551954182105]
Multimodal symbolic logical reasoning is critical in high-stakes applications such as autonomous driving and medical diagnosis.<n>We introduce the first benchmark Mu SLR for multimodal symbolic logical reasoning grounded in formal logical rules.<n>We propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1's Chain-of-Thought performance by 14.13%.
arXiv Detail & Related papers (2025-09-30T06:42:20Z)
From Ambiguity to Verdict: A Semiotic-Grounded Multi-Perspective Agent for LLM Logical Reasoning [16.381034926435074]
LogicAgent is a semiotic-square-guided framework designed to jointly address logical complexity and semantic complexity.<n>To overcome the semantic simplicity and low logical complexity of existing datasets, we introduce RepublicQA, a benchmark that reaches college-level difficulty.<n>Experiments demonstrate that LogicAgent achieves state-of-the-art performance on RepublicQA, with a 6.25% average gain over strong baselines.
arXiv Detail & Related papers (2025-09-29T13:31:22Z)
Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models [58.456656119178064]
Vision-Language Models (VLMs) have emerged as foundational for multimodal intelligence.<n>However, their capacity for logical understanding remains significantly underexplored.<n>We introduce LogicBench, a benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios.<n>We propose LogicCLIP, a training framework designed to boost VLMs' logical sensitivity.
arXiv Detail & Related papers (2025-08-15T08:40:13Z)
CALM: Contextual Analog Logic with Multimodality [9.763339269757227]
We introduce Contextual Analog Logic with Multimodality (CALM)<n>CALM unites symbolic reasoning with neural generation.<n>It enables systems to make context-sensitive decisions grounded in real-world multi-modal data.
arXiv Detail & Related papers (2025-06-17T19:40:32Z)
LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models [9.339988760379915]
LogicTree is an inference-time modular framework employing algorithm-guided search to automate structured proof exploration.<n>We introduce two-free derivations for premise prioritization, enabling strategic proof search.<n>Within LogicTree, GPT-4o outperforms o3-mini by 7.6% on average.
arXiv Detail & Related papers (2025-04-18T22:10:02Z)
SCoRE: Benchmarking Long-Chain Reasoning in Commonsense Scenarios [33.72114830484246]
We introduce SCoRE (Scenario-based Commonsense Reasoning Evaluation), a benchmark that synthesizes multi-hop questions from scenario schemas of entities, relations, and logical rules.<n>SCoRE contains 100k bilingual (Chinese-English) multiple-choice questions whose reasoning chains span 2-11 hops and are grouped into various difficulty levels.
arXiv Detail & Related papers (2025-03-08T13:40:10Z)
LogicGame: Benchmarking Rule-Based Reasoning Abilities of Large Language Models [87.49676980090555]
Large Language Models (LLMs) have demonstrated notable capabilities across various tasks, showcasing complex problem-solving abilities. We introduce LogicGame, a novel benchmark designed to evaluate the comprehensive rule understanding, execution, and planning capabilities of LLMs.
arXiv Detail & Related papers (2024-08-28T13:16:41Z)
LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z)
LINC: A Neurosymbolic Approach for Logical Reasoning by Combining Language Models with First-Order Logic Provers [60.009969929857704]
Logical reasoning is an important task for artificial intelligence with potential impacts on science, mathematics, and society. In this work, we reformulating such tasks as modular neurosymbolic programming, which we call LINC. We observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate.
arXiv Detail & Related papers (2023-10-23T17:58:40Z)
Logic-LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning [101.26814728062065]
Large Language Models (LLMs) have shown human-like reasoning abilities but still struggle with complex logical problems. This paper introduces a novel framework, Logic-LM, which integrates LLMs with symbolic solvers to improve logical problem-solving.
arXiv Detail & Related papers (2023-05-20T22:25:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.