Related papers: RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises

RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises

URL: http://arxiv.org/abs/2502.13125v1
Date: Tue, 18 Feb 2025 18:47:11 GMT
Title: RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises
Authors: Zenan Zhai, Hao Li, Xudong Han, Zhenxuan Zhang, Yixuan Zhang, Timothy Baldwin, Haonan Li,
Abstract summary: We present RuozhiBench, a dataset containing 677 carefully curated questions that contain various forms of deceptive reasoning.<n>We evaluate 17 large language models (LLMs) from 5 Series over RuozhiBench using both open-ended and two-choice formats.<n>LLMs showed limited ability to detect and reason correctly about logical fallacies, with even the best-performing model, Claude-3-haiku, achieving only 62% accuracy compared to the human of more than 90%.
Score: 41.39610589639382
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs) have shown that they can answer questions requiring complex reasoning. However, their ability to identify and respond to text containing logical fallacies or deliberately misleading premises remains less studied. To address this gap, we introduce RuozhiBench, a bilingual dataset comprising 677 carefully curated questions that contain various forms of deceptive reasoning, meticulously crafted through extensive human effort and expert review. In a comprehensive evaluation of 17 LLMs from 5 Series over RuozhiBench using both open-ended and two-choice formats, we conduct extensive analyses on evaluation protocols and result patterns. Despite their high scores on conventional benchmarks, these models showed limited ability to detect and reason correctly about logical fallacies, with even the best-performing model, Claude-3-haiku, achieving only 62% accuracy compared to the human of more than 90%.

Related papers

Can AI Master Construction Management (CM)? Benchmarking State-of-the-Art Large Language Models on CM Certification Exams [2.897171041611256]
This study introduces CMExamSet, a benchmarking dataset comprising 689 authentic multiple-choice questions from four nationally accredited CM certification exams. Results indicate that GPT-4o and Claude 3.7 surpass typical human pass thresholds (70%), with average accuracies of 82% and 83%, respectively. conceptual misunderstandings are the most common, underscoring the need for enhanced domain-specific reasoning models.
arXiv Detail & Related papers (2025-04-04T18:13:45Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data. These findings highlight the reliance on recall over rigorous logical inference. This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities [39.68147391225923]
We present DocPuzzle, a rigorously constructed benchmark for evaluating long-context reasoning capabilities in large language models (LLMs) This benchmark comprises 100 expert-level QA problems requiring multi-step reasoning over long real-world documents. We introduce an innovative evaluation framework that mitigates guessing bias through checklist-guided process analysis.
arXiv Detail & Related papers (2025-02-25T03:29:53Z)
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks [0.9831489366502301]
We introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts.<n>Using this method, we evaluate state-of-the-art proprietary and open-source LLMs on two datasets available in English and Spanish.<n>Results show that all models experience remarkable accuracy drops, with an average loss of 57% on MMLU and 50% on UNED-Access 2024.
arXiv Detail & Related papers (2025-02-18T14:32:44Z)
Large Language Models and Mathematical Reasoning Failures [1.6114012813668932]
This paper investigates the mathematical reasoning capabilities of large language models (LLMs) using 50 newly constructed high-school-level word problems.<n>We rigorously analyze both final answers and solution steps to identify reasoning failures.<n>We find that while newer models (e.g., o3-mini, deepseek-r1) achieve higher accuracy, all models exhibit errors in spatial reasoning, strategic planning, and arithmetic.
arXiv Detail & Related papers (2025-02-17T09:07:32Z)
Calling a Spade a Heart: Gaslighting Multimodal Large Language Models via Negation [65.92001420372007]
This paper systematically evaluates state-of-the-art MLLMs across diverse benchmarks. We introduce the first benchmark GaslightingBench, specifically designed to evaluate the vulnerability of MLLMs to negation arguments.
arXiv Detail & Related papers (2025-01-31T10:37:48Z)
JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [51.99046112135311]
We introduce JustLogic, a synthetically generated deductive reasoning benchmark for rigorous evaluation of Large Language Models. JustLogic is highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures. Our experimental results reveal that most state-of-the-art (SOTA) LLMs perform significantly worse than the human average.
arXiv Detail & Related papers (2025-01-24T15:49:10Z)
LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages [8.754506364968394]
The LingOly benchmark is a novel benchmark for advanced reasoning abilities in large language models. We evaluate capabilities for in-context identification and generalisation of linguistic patterns in very low-resource or extinct languages. We assess performance with both direct accuracy and comparison to a no-context baseline to penalise memorisation.
arXiv Detail & Related papers (2024-06-10T11:50:29Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models [51.75805497456226]
This work focuses on the factual consistency issue with the help of the dialogue summarization task. Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. To stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data.
arXiv Detail & Related papers (2023-11-13T09:32:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.