Related papers: CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse

CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse

URL: http://arxiv.org/abs/2602.08939v1
Date: Mon, 09 Feb 2026 17:36:56 GMT
Title: CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse
Authors: Longling Geng, Andy Ouyang, Theodore Wu, Daphne Barretto, Matthew John Hayes, Rachael Cooper, Yuqiao Zeng, Sameer Vijay, Gia Ancone, Ankit Rai, Matthew Wolfman, Patrick Flanagan, Edward Y. Chang,
Abstract summary: CausalT5K is a diagnostic benchmark of over 5,000 cases across 10 domains.<n>Unlike synthetic benchmarks, CausalT5K embeds causal traps in realistic narratives.<n>Preliminary experiments reveal a Four-Quadrant Control Landscape where static audit policies universally fail.
Score: 1.4608214000864057
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: LLM failures in causal reasoning, including sycophancy, rung collapse, and miscalibrated refusal, are well-documented, yet progress on remediation is slow because no benchmark enables systematic diagnosis. We introduce CausalT5K, a diagnostic benchmark of over 5,000 cases across 10 domains that tests three critical capabilities: (1) detecting rung collapse, where models answer interventional queries with associational evidence; (2) resisting sycophantic drift under adversarial pressure; and (3) generating Wise Refusals that specify missing information when evidence is underdetermined. Unlike synthetic benchmarks, CausalT5K embeds causal traps in realistic narratives and decomposes performance into Utility (sensitivity) and Safety (specificity), revealing failure modes invisible to aggregate accuracy. Developed through a rigorous human-machine collaborative pipeline involving 40 domain experts, iterative cross-validation cycles, and composite verification via rule-based, LLM, and human scoring, CausalT5K implements Pearl's Ladder of Causation as research infrastructure. Preliminary experiments reveal a Four-Quadrant Control Landscape where static audit policies universally fail, a finding that demonstrates CausalT5K's value for advancing trustworthy reasoning systems. Repository: https://github.com/genglongling/CausalT5kBench

Related papers

Reason-IAD: Knowledge-Guided Dynamic Latent Reasoning for Explainable Industrial Anomaly Detection [85.29900916231655]
Reason-IAD is a knowledge-guided dynamic latent reasoning framework for explainable industrial anomaly detection.<n>Experiments demonstrate that Reason-IAD consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2026-02-10T14:54:17Z)
LogicGaze: Benchmarking Causal Consistency in Visual Narratives via Counterfactual Verification [41.99844472131922]
We introduce LogicGaze, a novel benchmark framework designed to rigorously interrogate whether Vision-Language Models can validate sequential causal chains against visual inputs.<n>Our tripartite evaluation protocol exposes significant vulnerabilities in state-of-the-art VLMs such as Qwen2.5-VL-72B.<n> LogicGaze advocates for robust, trustworthy multimodal reasoning, with all resources publicly available in an anonymized repository.
arXiv Detail & Related papers (2026-01-30T20:28:01Z)
RAudit: A Blind Auditing Protocol for Large Language Model Reasoning [0.8594140167290097]
Inference-time scaling can amplify reasoning pathologies: sycophancy, rung collapse, and premature certainty.<n>We present RAudit, a diagnostic protocol for auditing LLM reasoning without ground truth access.
arXiv Detail & Related papers (2026-01-30T16:22:45Z)
Think Locally, Explain Globally: Graph-Guided LLM Investigations via Local Reasoning and Belief Propagation [5.191980417814362]
LLM agents excel when environments are mostly static and the needed information fits in a model's context window.<n>ReAct-style agents are especially brittle in this regime.<n>We propose EoG, a framework in which an LLM performs bounded local evidence mining and labeling (cause vs symptom) while a deterministic controller manages, state, and belief propagation to compute a minimal explanatory frontier.
arXiv Detail & Related papers (2026-01-25T17:27:19Z)
CoG: Controllable Graph Reasoning via Relational Blueprints and Failure-Aware Refinement over Knowledge Graphs [53.199517625701475]
CoG is a training-free framework inspired by Dual-Process Theory that mimics the interplay between intuition and deliberation.<n>CoG significantly outperforms state-of-the-art approaches in both accuracy and efficiency.
arXiv Detail & Related papers (2026-01-16T07:27:40Z)
SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence [60.202862987441684]
We introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity.<n>Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints.<n>By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures.
arXiv Detail & Related papers (2026-01-08T09:45:58Z)
Compressed Causal Reasoning: Quantization and GraphRAG Effects on Interventional and Counterfactual Accuracy [0.0]
This study systematically evaluate quantization effects across all three levels of Pearls Causal Ladder.<n>We find that rung level accuracy in Llama 3 8B remains broadly stable under quantization, with NF4 showing less than one percent overall degradation.<n>Experiments on the CRASS benchmark show near identical performance across precisions, indicating that existing commonsense counterfactual datasets lack the structural sensitivity needed to reveal quantization induced reasoning drift.
arXiv Detail & Related papers (2025-12-13T17:54:15Z)
FaithCoT-Bench: Benchmarking Instance-Level Faithfulness of Chain-of-Thought Reasoning [62.452350134196934]
FaithCoT-Bench is a unified benchmark for instance-level CoT unfaithfulness detection.<n>Our framework formulates unfaithfulness detection as a discriminative decision problem.<n>FaithCoT-Bench sets a solid basis for future research toward more interpretable and trustworthy reasoning in LLMs.
arXiv Detail & Related papers (2025-10-05T05:16:54Z)
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics [89.1999907891494]
We present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox.<n>Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures.<n>We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies.
arXiv Detail & Related papers (2025-10-01T07:59:03Z)
CSnake: Detecting Self-Sustaining Cascading Failure via Causal Stitching of Fault Propagations [7.708183748221455]
This paper presents CSnake, a fault injection framework to expose self-sustaining cascading failures in distributed systems.<n>CSnake uses the novel idea of causal stitching, which causally links multiple single-fault injections in different tests to simulate complex fault propagation chains.<n>CSnake detected 15 bugs that cause self-sustaining cascading failures in five systems, five of which have been confirmed with two fixed.
arXiv Detail & Related papers (2025-09-30T17:04:31Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.