LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning
- URL: http://arxiv.org/abs/2601.16504v2
- Date: Tue, 27 Jan 2026 18:33:20 GMT
- Title: LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning
- Authors: Obed Junias, Maria Leonor Pacheco,
- Abstract summary: LOGICAL-COMMONSENSEQA re-frames commonsense reasoning as logical composition over pairs of atomic statements.<n>We find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions.
- Score: 6.8658995041250455
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.
Related papers
- OrLog: Resolving Complex Queries with LLMs and Probabilistic Reasoning [51.58235452818926]
We introduce OrLog, a neuro-symbolic retrieval framework that decouples predicate-level plausibility estimation from logical reasoning.<n>A large language model (LLM) provides plausibility scores for atomic predicates in one decoding-free forward pass, from which a probabilistic reasoning engine derives the posterior probability of query satisfaction.
arXiv Detail & Related papers (2026-01-30T15:31:58Z) - Are Language Models Efficient Reasoners? A Perspective from Logic Programming [109.47572890883248]
Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency.<n>We propose a framework for assessing LM reasoning efficiency through the lens of logic programming.
arXiv Detail & Related papers (2025-10-29T15:30:31Z) - DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models [58.439517684779936]
This paper proposes a new classical logic benchmark DivLogicEval, consisting of natural sentences composed of diverse statements in a counterintuitive way.<n>To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in Large Language Models.
arXiv Detail & Related papers (2025-09-19T04:40:46Z) - Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding [4.9301587184653295]
Negation is a fundamental linguistic phenomenon that poses persistent challenges for Large Language Models.<n>Existing benchmarks often treat negation as a side case within broader tasks like natural language inference.<n>We introduce Thunder-NUBench, a novel benchmark explicitly designed to assess sentence-level negation understanding in LLMs.
arXiv Detail & Related papers (2025-06-17T10:51:39Z) - When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs [19.354141845315276]
Chain-of-thought reasoning can significantly degrade instruction-following accuracy.<n>This is the first work to systematically expose reasoning-induced failures in instruction-following.
arXiv Detail & Related papers (2025-05-16T16:36:00Z) - Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions [0.36868085124383626]
This paper proposes a benchmark that corresponds to various defeasible rule-based reasoning patterns.
We modified an existing benchmark for defeasible logic reasoners by translating defeasible rules into text suitable for Large Language Models.
We conducted preliminary experiments on nonmonotonic rule-based reasoning using ChatGPT and compared it with reasoning patterns defined by defeasible logic.
arXiv Detail & Related papers (2024-10-16T12:36:23Z) - Modeling Hierarchical Reasoning Chains by Linking Discourse Units and
Key Phrases for Reading Comprehension [80.99865844249106]
We propose a holistic graph network (HGN) which deals with context at both discourse level and word level, as the basis for logical reasoning.
Specifically, node-level and type-level relations, which can be interpreted as bridges in the reasoning process, are modeled by a hierarchical interaction mechanism.
arXiv Detail & Related papers (2023-06-21T07:34:27Z) - Rationale-Augmented Ensembles in Language Models [53.45015291520658]
We reconsider rationale-augmented prompting for few-shot in-context learning.
We identify rationale sampling in the output space as the key component to robustly improve performance.
We demonstrate that rationale-augmented ensembles achieve more accurate and interpretable results than existing prompting approaches.
arXiv Detail & Related papers (2022-07-02T06:20:57Z) - Logical Satisfiability of Counterfactuals for Faithful Explanations in
NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals.
It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation.
It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.