Related papers: LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning

LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning

URL: http://arxiv.org/abs/2601.16504v2
Date: Tue, 27 Jan 2026 18:33:20 GMT
Title: LOGICAL-COMMONSENSEQA: A Benchmark for Logical Commonsense Reasoning
Authors: Obed Junias, Maria Leonor Pacheco,
Abstract summary: LOGICAL-COMMONSENSEQA re-frames commonsense reasoning as logical composition over pairs of atomic statements.<n>We find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions.
Score: 6.8658995041250455
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Commonsense reasoning often involves evaluating multiple plausible interpretations rather than selecting a single atomic answer, yet most benchmarks rely on single-label evaluation, obscuring whether statements are jointly plausible, mutually exclusive, or jointly implausible. We introduce LOGICAL-COMMONSENSEQA, a benchmark that re-frames commonsense reasoning as logical composition over pairs of atomic statements using plausibility-level operators (AND, OR, NEITHER/NOR). Evaluating instruction-tuned, reasoning-specialized, and fine-tuned models under zero-shot, few-shot, and chain-of-thought prompting, we find that while models perform reasonably on conjunctive and moderately on disjunctive reasoning, performance degrades sharply on negation-based questions. LOGICAL-COMMONSENSEQA exposes fundamental reasoning limitations and provides a controlled framework for advancing compositional commonsense reasoning.

Related papers

OrLog: Resolving Complex Queries with LLMs and Probabilistic Reasoning [51.58235452818926]
We introduce OrLog, a neuro-symbolic retrieval framework that decouples predicate-level plausibility estimation from logical reasoning.<n>A large language model (LLM) provides plausibility scores for atomic predicates in one decoding-free forward pass, from which a probabilistic reasoning engine derives the posterior probability of query satisfaction.
arXiv Detail & Related papers (2026-01-30T15:31:58Z)
Are Language Models Efficient Reasoners? A Perspective from Logic Programming [109.47572890883248]
Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency.<n>We propose a framework for assessing LM reasoning efficiency through the lens of logic programming.
arXiv Detail & Related papers (2025-10-29T15:30:31Z)
DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models [58.439517684779936]
This paper proposes a new classical logic benchmark DivLogicEval, consisting of natural sentences composed of diverse statements in a counterintuitive way.<n>To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in Large Language Models.
arXiv Detail & Related papers (2025-09-19T04:40:46Z)
Thunder-NUBench: A Benchmark for LLMs' Sentence-Level Negation Understanding [4.9301587184653295]
Negation is a fundamental linguistic phenomenon that poses persistent challenges for Large Language Models.<n>Existing benchmarks often treat negation as a side case within broader tasks like natural language inference.<n>We introduce Thunder-NUBench, a novel benchmark explicitly designed to assess sentence-level negation understanding in LLMs.
arXiv Detail & Related papers (2025-06-17T10:51:39Z)
When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs [19.354141845315276]
Chain-of-thought reasoning can significantly degrade instruction-following accuracy.<n>This is the first work to systematically expose reasoning-induced failures in instruction-following.
arXiv Detail & Related papers (2025-05-16T16:36:00Z)
Benchmarking Defeasible Reasoning with Large Language Models -- Initial Experiments and Future Directions [0.36868085124383626]
This paper proposes a benchmark that corresponds to various defeasible rule-based reasoning patterns. We modified an existing benchmark for defeasible logic reasoners by translating defeasible rules into text suitable for Large Language Models. We conducted preliminary experiments on nonmonotonic rule-based reasoning using ChatGPT and compared it with reasoning patterns defined by defeasible logic.
arXiv Detail & Related papers (2024-10-16T12:36:23Z)
Modeling Hierarchical Reasoning Chains by Linking Discourse Units and Key Phrases for Reading Comprehension [80.99865844249106]
We propose a holistic graph network (HGN) which deals with context at both discourse level and word level, as the basis for logical reasoning. Specifically, node-level and type-level relations, which can be interpreted as bridges in the reasoning process, are modeled by a hierarchical interaction mechanism.
arXiv Detail & Related papers (2023-06-21T07:34:27Z)
Rationale-Augmented Ensembles in Language Models [53.45015291520658]
We reconsider rationale-augmented prompting for few-shot in-context learning. We identify rationale sampling in the output space as the key component to robustly improve performance. We demonstrate that rationale-augmented ensembles achieve more accurate and interpretable results than existing prompting approaches.
arXiv Detail & Related papers (2022-07-02T06:20:57Z)
Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals. It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation. It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.