Related papers: BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning

BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning

URL: http://arxiv.org/abs/2506.06955v4
Date: Mon, 14 Jul 2025 02:22:42 GMT
Title: BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning
Authors: Ha-Thanh Nguyen, Chaoran Liu, Qianying Liu, Hideyuki Tachibana, Su Myat Noe, Yusuke Miyao, Koichi Takeda, Sadao Kurohashi,
Abstract summary: We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs)<n>Unlike prior datasets such as NeuBAROCO and JFLD, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora.
Score: 36.88497412912196
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora. We benchmark state-of-the-art models - including GPT models, Claude models, and leading Japanese LLMs - revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs. These findings have important implications for deploying LLMs in high-stakes domains such as law, healthcare, and scientific literature, where truth must override intuitive belief to ensure integrity and safety.

Related papers

FairReason: Balancing Reasoning and Social Bias in MLLMs [50.618158642714505]
Multimodal Large Language Models (MLLMs) already achieve state-of-the-art results across a wide range of tasks and modalities.<n>Recent studies explore advanced prompting schemes and post-training fine-tuning to push their reasoning ability further.
arXiv Detail & Related papers (2025-07-30T19:57:22Z)
MMLU-Reason: Benchmarking Multi-Task Multi-modal Language Understanding and Reasoning [40.55833679660528]
We introduce the MMLU-Reason, a new benchmark designed to rigorously evaluate multi-modal reasoning with explicit thinking.<n>The MMLU-Reason comprises 1) a high-difficulty dataset of 1,083 questions spanning six diverse reasoning types with symbolic depth and multi-hop demands.<n>Overall, the MMLU-Reason offers a scalable foundation for evaluating, comparing, and improving the next generation of multi-modal reasoning systems.
arXiv Detail & Related papers (2025-05-22T09:41:55Z)
Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask [30.819697001992154]
Large Language Models are a promising tool for automated vulnerability detection.<n>Despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities?<n>This paper challenges three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales.
arXiv Detail & Related papers (2025-04-18T05:32:47Z)
Belief in the Machine: Investigating Epistemological Blind Spots of Language Models [51.63547465454027]
Language models (LMs) are essential for reliable decision-making in fields like healthcare, law, and journalism. This study systematically evaluates the capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE. Our results reveal key limitations. First, while LMs achieve 86% accuracy on factual scenarios, their performance drops significantly with false scenarios. Second, LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data.
arXiv Detail & Related papers (2024-10-28T16:38:20Z)
Aligning with Logic: Measuring, Evaluating and Improving Logical Preference Consistency in Large Language Models [31.558429029429863]
Large Language Models (LLMs) are expected to be predictable and trustworthy to support reliable decision-making systems.<n>This work examines logical preference consistency as a foundational requirement for building more dependable LLM systems.<n>We show that improving consistency leads to better performance in LLM-driven logic-based algorithms.
arXiv Detail & Related papers (2024-10-03T04:34:04Z)
A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences [5.141416267381492]
We consider the case of syllogistic reasoning, an area of deductive reasoning studied extensively in logic and cognitive psychology. We investigate the effects of chain-of-thought reasoning, in-context learning, and supervised fine-tuning on syllogistic reasoning. Our results suggest that the behavior of pre-trained LLMs can be explained by cognitive science.
arXiv Detail & Related papers (2024-06-17T08:59:04Z)
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models. We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
Exploring Self-supervised Logic-enhanced Training for Large Language Models [59.227222647741094]
In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training. We devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion. The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM.
arXiv Detail & Related papers (2023-05-23T06:13:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.