Related papers: A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners

A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners

URL: http://arxiv.org/abs/2406.11050v2
Date: Fri, 04 Oct 2024 04:40:03 GMT
Title: A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners
Authors: Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J. Su, Camillo J. Taylor, Dan Roth,
Abstract summary: This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities. We develop carefully controlled synthetic datasets, featuring conjunction fallacy and syllogistic problems.
Score: 58.15511660018742
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities or primarily depend on token bias. We go beyond evaluating LLMs on accuracy; rather, we aim to investigate their token bias in solving logical reasoning tasks. Specifically, we develop carefully controlled synthetic datasets, featuring conjunction fallacy and syllogistic problems. Our framework outlines a list of hypotheses where token biases are readily identifiable, with all null hypotheses assuming genuine reasoning capabilities of LLMs. The findings in this study suggest, with statistical guarantee, that most LLMs still struggle with logical reasoning. While they may perform well on classic problems, their success largely depends on recognizing superficial patterns with strong token bias, thereby raising concerns about their actual reasoning and generalization abilities. Codes and data are open-sourced at https://github.com/bowen-upenn/llm_token_bias.

Related papers

BiasCause: Evaluate Socially Biased Causal Reasoning of Large Language Models [8.368810576899556]
Large language models (LLMs) generate content including social bias against certain sensitive groups. This paper goes one step further to evaluate the causal reasoning process of LLMs when they answer questions eliciting social biases.
arXiv Detail & Related papers (2025-04-08T20:00:14Z)
Evaluating Social Biases in LLM Reasoning [19.824838766883534]
This paper evaluated the 8B and 32B variants of DeepSeek-R1 against their instruction tuned counterparts on the BBQ dataset. To the best of our knowledge, this empirical study is the first to assess bias issues in LLM reasoning.
arXiv Detail & Related papers (2025-02-21T10:16:07Z)
JustLogic: A Comprehensive Benchmark for Evaluating Deductive Reasoning in Large Language Models [51.99046112135311]
We introduce JustLogic, a synthetically generated deductive reasoning benchmark for rigorous evaluation of Large Language Models. JustLogic is highly complex, capable of generating a diverse range of linguistic patterns, vocabulary, and argument structures. Our experimental results reveal that most state-of-the-art (SOTA) LLMs perform significantly worse than the human average.
arXiv Detail & Related papers (2025-01-24T15:49:10Z)
Critical-Questions-of-Thought: Steering LLM reasoning with Argumentative Querying [0.3659498819753633]
State-of-the-art Large Language models (LLMs) continue to struggle when performing logical and mathematical reasoning. This paper makes use of the notion of critical questions from the literature on argumentation theory, focusing in particular on Toulmin's model of argumentation. We show that employing these critical questions can improve the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-19T18:51:30Z)
Rulebreakers Challenge: Revealing a Blind Spot in Large Language Models' Reasoning with Formal Logic [3.0648414540406703]
This study introduces the concept of "rulebreakers", which refers to instances where logical entailment diverges from factually acceptable inference. We present RULEBREAKERS, a novel dataset for evaluating Large Language Models' ability to distinguish between rulebreakers and non-rulebreakers.
arXiv Detail & Related papers (2024-10-21T20:48:16Z)
Are LLMs Good Zero-Shot Fallacy Classifiers? [24.3005882003251]
We focus on leveraging Large Language Models (LLMs) for zero-shot fallacy classification. With comprehensive experiments on benchmark datasets, we suggest that LLMs could be potential zero-shot fallacy classifiers. Our novel multi-round prompting schemes can effectively bring about more improvements, especially for small LLMs.
arXiv Detail & Related papers (2024-10-19T09:38:55Z)
Evaluating Consistency and Reasoning Capabilities of Large Language Models [0.0]
Large Language Models (LLMs) are extensively used today across various sectors, including academia, research, business, and finance. Despite their widespread adoption, these models often produce incorrect and misleading information, exhibiting a tendency to hallucinate. This paper aims to evaluate and compare the consistency and reasoning capabilities of both public and proprietary LLMs.
arXiv Detail & Related papers (2024-04-25T10:03:14Z)
LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks. But, can they really "reason" over the natural language? This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z)
LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements [59.71218039095155]
Task of reading comprehension (RC) provides a primary means to assess language models' natural language understanding (NLU) capabilities. If the context aligns with the models' internal knowledge, it is hard to discern whether the models' answers stem from context comprehension or from internal information. To address this issue, we suggest to use RC on imaginary data, based on fictitious facts and entities.
arXiv Detail & Related papers (2024-04-09T13:08:56Z)
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models. We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
Large Language Models are In-Context Semantic Reasoners rather than Symbolic Reasoners [75.85554779782048]
Large Language Models (LLMs) have excited the natural language and machine learning community over recent years. Despite of numerous successful applications, the underlying mechanism of such in-context capabilities still remains unclear. In this work, we hypothesize that the learned textitsemantics of language tokens do the most heavy lifting during the reasoning process.
arXiv Detail & Related papers (2023-05-24T07:33:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.