Related papers: More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning

More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning

URL: http://arxiv.org/abs/2506.03923v1
Date: Wed, 04 Jun 2025 13:15:01 GMT
Title: More or Less Wrong: A Benchmark for Directional Bias in LLM Comparative Reasoning
Authors: Mohammadamin Shafiei, Hamidreza Saffari, Nafise Sadat Moosavi,
Abstract summary: We study the mechanisms by which semantic cues shape reasoning in large language models.<n>We introduce MathComp, a benchmark of 300 comparison scenarios.<n>We find that model errors frequently reflect linguistic steering, systematic shifts toward the comparative term present in the prompt.
Score: 10.301985230669684
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large language models (LLMs) are known to be sensitive to input phrasing, but the mechanisms by which semantic cues shape reasoning remain poorly understood. We investigate this phenomenon in the context of comparative math problems with objective ground truth, revealing a consistent and directional framing bias: logically equivalent questions containing the words ``more'', ``less'', or ``equal'' systematically steer predictions in the direction of the framing term. To study this effect, we introduce MathComp, a controlled benchmark of 300 comparison scenarios, each evaluated under 14 prompt variants across three LLM families. We find that model errors frequently reflect linguistic steering, systematic shifts toward the comparative term present in the prompt. Chain-of-thought prompting reduces these biases, but its effectiveness varies: free-form reasoning is more robust, while structured formats may preserve or reintroduce directional drift. Finally, we show that including demographic identity terms (e.g., ``a woman'', ``a Black person'') in input scenarios amplifies directional drift, despite identical underlying quantities, highlighting the interplay between semantic framing and social referents. These findings expose critical blind spots in standard evaluation and motivate framing-aware benchmarks for diagnosing reasoning robustness and fairness in LLMs.

Related papers

CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching [50.65932158912512]
We propose a new causal reasoning benchmark, CausalFlip, to encourage the development of new large language models.<n>CaulFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations.<n>We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought supervision, and a proposed internalized causal reasoning approach.
arXiv Detail & Related papers (2026-02-23T18:06:15Z)
LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse [0.18268488712787334]
Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale.<n>We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model's own predictions.<n>We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning.
arXiv Detail & Related papers (2026-02-10T14:38:13Z)
DeFrame: Debiasing Large Language Models Against Framing Effects [12.839436067299188]
Large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial.<n>Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside evaluation settings.<n>We identify framing -- differences in how semantically equivalent prompts are expressed -- as an underexplored contributor to this gap.
arXiv Detail & Related papers (2026-02-04T08:15:51Z)
The Paradox of Robustness: Decoupling Rule-Based Logic from Affective Noise in High-Stakes Decision-Making [1.0671844383558033]
Large Language Models (LLMs) are widely documented to be sensitive to minor prompt perturbations and prone to sycophantic alignment with user biases.<n>We quantify a robustness gap where LLMs demonstrate 110-300 times greater resistance to narrative manipulation than human subjects.<n>We show that while LLMs may be "brittle" to how a query is formatted, they are remarkably "stable" against why a decision should be biased.
arXiv Detail & Related papers (2026-01-29T09:17:05Z)
Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z)
Are Language Models Efficient Reasoners? A Perspective from Logic Programming [109.47572890883248]
Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency.<n>We propose a framework for assessing LM reasoning efficiency through the lens of logic programming.
arXiv Detail & Related papers (2025-10-29T15:30:31Z)
Framework for Machine Evaluation of Reasoning Completeness in Large Language Models For Classification Tasks [0.0]
This paper introduces RACE-Reasoning Alignment for Completeness of Explanations.<n>We analyze four widely used text classification datasets-WIKI ONTOLOGY, AG NEWS, IMDB, and GOEMOTIONS.<n>We show that correct predictions exhibit higher coverage of supporting features, while incorrect predictions are associated with elevated coverage of contradicting features.
arXiv Detail & Related papers (2025-10-23T20:22:22Z)
DivLogicEval: A Framework for Benchmarking Logical Reasoning Evaluation in Large Language Models [58.439517684779936]
This paper proposes a new classical logic benchmark DivLogicEval, consisting of natural sentences composed of diverse statements in a counterintuitive way.<n>To ensure a more reliable evaluation, we also introduce a new evaluation metric that mitigates the influence of bias and randomness inherent in Large Language Models.
arXiv Detail & Related papers (2025-09-19T04:40:46Z)
Quantifying Fairness in LLMs Beyond Tokens: A Semantic and Statistical Perspective [13.739343897204568]
Large Language Models (LLMs) often generate responses with inherent biases, undermining their reliability in real-world applications.<n>Existing evaluation methods often overlook biases in long-form responses and the intrinsic variability of LLM outputs.<n>We propose FiSco, a novel statistical framework to evaluate group-level fairness in LLMs by detecting subtle semantic differences in long-form responses across demographic groups.
arXiv Detail & Related papers (2025-06-23T18:31:22Z)
Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study [34.29839553042609]
We propose FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions.<n>We conduct a study on the effects of supervision format during fine-tuning.<n>Our findings reveal that natural language supervision yields strong generalization even on out-of-distribution and long-context tasks.
arXiv Detail & Related papers (2025-06-05T09:34:12Z)
A Closer Look at Bias and Chain-of-Thought Faithfulness of Large (Vision) Language Models [53.18562650350898]
Chain-of-thought (CoT) reasoning enhances performance of large language models.<n>We present the first comprehensive study of CoT faithfulness in large vision-language models.
arXiv Detail & Related papers (2025-05-29T18:55:05Z)
Relative Bias: A Comparative Framework for Quantifying Bias in LLMs [29.112649816695203]
Relative Bias is a method designed to assess how an LLM's behavior deviates from other LLMs within a specified target domain.<n>We introduce two complementary methodologies: (1) Embedding Transformation analysis, which captures relative bias patterns through sentence representations over the embedding space, and (2) LLM-as-a-Judge, which employs a language model to evaluate outputs comparatively.<n>Applying our framework to several case studies on bias and alignment scenarios following by statistical tests for validation, we find strong alignment between the two scoring methods.
arXiv Detail & Related papers (2025-05-22T01:59:54Z)
LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning [49.58786377307728]
This paper adopts an exploratory approach by introducing a controlled evaluation environment for analogical reasoning.<n>We analyze the comparative dynamics of inductive, abductive, and deductive inference pipelines.<n>We investigate advanced paradigms such as hypothesis selection, verification, and refinement, revealing their potential to scale up logical inference.
arXiv Detail & Related papers (2025-02-16T15:54:53Z)
Covert Bias: The Severity of Social Views' Unalignment in Language Models Towards Implicit and Explicit Opinion [0.40964539027092917]
We evaluate the severity of bias toward a view by using a biased model in edge cases of excessive bias scenarios. Our findings reveal a discrepancy in LLM performance in identifying implicit and explicit opinions, with a general tendency of bias toward explicit opinions of opposing stances. The direct, incautious responses of the unaligned models suggest a need for further refinement of decisiveness.
arXiv Detail & Related papers (2024-08-15T15:23:00Z)
Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales. We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z)
Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of LLMs for Analyzing Categorical Syllogism [62.571419297164645]
This paper provides a systematic overview of prior works on the logical reasoning ability of large language models for analyzing categorical syllogisms.<n>We first investigate all the possible variations for the categorical syllogisms from a purely logical perspective.<n>We then examine the underlying configurations (i.e., mood and figure) tested by the existing datasets.
arXiv Detail & Related papers (2024-06-26T21:17:20Z)
Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial. We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)
Logical Satisfiability of Counterfactuals for Faithful Explanations in NLI [60.142926537264714]
We introduce the methodology of Faithfulness-through-Counterfactuals. It generates a counterfactual hypothesis based on the logical predicates expressed in the explanation. It then evaluates if the model's prediction on the counterfactual is consistent with that expressed logic.
arXiv Detail & Related papers (2022-05-25T03:40:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.