Related papers: D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models

URL: http://arxiv.org/abs/2509.17938v1
Date: Mon, 22 Sep 2025 15:59:40 GMT
Title: D-REX: A Benchmark for Detecting Deceptive Reasoning in Large Language Models
Authors: Satyapriya Krishna, Andy Zou, Rahul Gupta, Eliot Krzysztof Jones, Nick Winter, Dan Hendrycks, J. Zico Kolter, Matt Fredrikson, Spyros Matsoukas,
Abstract summary: Deceptive Reasoning Exposure Suite (D-REX) is a novel dataset designed to evaluate the discrepancy between a model's internal reasoning process and its final output.<n>Each sample in D-REX contains the adversarial system prompt, an end-user's test query, the model's seemingly innocuous response, and, crucially, the model's internal chain-of-thought.<n>We demonstrate that D-REX presents a significant challenge for existing models and safety mechanisms.
Score: 62.83226685925107
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The safety and alignment of Large Language Models (LLMs) are critical for their responsible deployment. Current evaluation methods predominantly focus on identifying and preventing overtly harmful outputs. However, they often fail to address a more insidious failure mode: models that produce benign-appearing outputs while operating on malicious or deceptive internal reasoning. This vulnerability, often triggered by sophisticated system prompt injections, allows models to bypass conventional safety filters, posing a significant, underexplored risk. To address this gap, we introduce the Deceptive Reasoning Exposure Suite (D-REX), a novel dataset designed to evaluate the discrepancy between a model's internal reasoning process and its final output. D-REX was constructed through a competitive red-teaming exercise where participants crafted adversarial system prompts to induce such deceptive behaviors. Each sample in D-REX contains the adversarial system prompt, an end-user's test query, the model's seemingly innocuous response, and, crucially, the model's internal chain-of-thought, which reveals the underlying malicious intent. Our benchmark facilitates a new, essential evaluation task: the detection of deceptive alignment. We demonstrate that D-REX presents a significant challenge for existing models and safety mechanisms, highlighting the urgent need for new techniques that scrutinize the internal processes of LLMs, not just their final outputs.

Related papers

The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation [11.984098021215878]
We introduce the Semantic-Preserving Adrial Code Injection (SPACI) Framework and the Abstract Syntax Tree-Aware Semantic Injection Protocol (AST-ASIP)<n>These methods exploit the Syntax-Semantics Gap by embedding adversarial directives into syntactically inert regions (trivia nodes) of the Abstract Syntax Tree.<n>Through a large-scale evaluation of 9 SOTA models across 25,000 submissions in Python, C, C++, and Java, we reveal catastrophic failure rates (>95%) in high-capacity open-weights models like DeepSeek-V3.
arXiv Detail & Related papers (2026-01-29T07:40:58Z)
CODE: A Contradiction-Based Deliberation Extension Framework for Overthinking Attacks on Retrieval-Augmented Generation [43.85448261466922]
We propose an end-to-end attack framework named Contradiction-Based Deliberation Extension (CODE)<n> CODE develops a multi-agent architecture to construct poisoning samples that are injected into the knowledge base.<n>Experiments show that CODE causes a 5.32x-24.72x increase in reasoning token consumption, without degrading task performance.
arXiv Detail & Related papers (2026-01-19T14:52:31Z)
Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models [72.4149653187766]
We propose a Reasoner-Verifier framework named Adrialversa Reasoning RAG (ARR)<n>The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other's logic while being guided by process-aware advantage.<n> Experiments on multiple benchmarks demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2026-01-08T06:57:03Z)
NeXT-IMDL: Build Benchmark for NeXT-Generation Image Manipulation Detection & Localization [67.84497768987023]
NeXT-IMDL is a large-scale diagnostic benchmark designed to probe the boundaries of current detectors.<n>NeXT-IMDL categorizes AIGC-based manipulations along four fundamental axes: editing models, manipulation types, content semantics, and forgery.<n>Our experiments on 11 representative models reveal a critical insight: while these models perform well in their original settings, they exhibit systemic failures and significant performance degradation.
arXiv Detail & Related papers (2025-12-29T11:09:35Z)
Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs [10.896368527058714]
Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries.<n>We introduce two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with "Focus" keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB)<n>Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios.
arXiv Detail & Related papers (2025-10-09T12:38:16Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
AdvChain: Adversarial Chain-of-Thought Tuning for Robust Safety Alignment of Large Reasoning Models [62.70575022567081]
We propose AdvChain, an alignment paradigm that teaches models dynamic self-correction through adversarial CoT tuning.<n>Our work establishes a new direction for building more robust and reliable reasoning models.
arXiv Detail & Related papers (2025-09-29T04:27:23Z)
AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models [6.059681491089391]
AURA provides comprehensive, step level evaluations across logical coherence and safety-awareness.<n>Our framework seamlessly combines introspective self-critique, fine-grained PRM assessments, and adaptive safety-aware decoding.<n>This research represents a pivotal step toward safer, more responsible, and contextually aware AI, setting a new benchmark for alignment-sensitive applications.
arXiv Detail & Related papers (2025-08-08T08:43:24Z)
Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models [0.0]
Reasoning Language Models (RLMs) have gained traction for their ability to perform complex, multi-step reasoning tasks.<n>While these capabilities promise improved reliability, their impact on robustness to social biases remains unclear.<n>We leverage the CLEAR-Bias benchmark to investigate the adversarial robustness of RLMs to bias elicitation.
arXiv Detail & Related papers (2025-07-03T17:01:53Z)
SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors [2.07180164747172]
High-risk industries like nuclear and aviation use real-time monitoring to detect dangerous system conditions.<n>We propose a real-time framework to predict harmful AI outputs before they occur by using an unsupervised approach.
arXiv Detail & Related papers (2025-05-20T12:49:58Z)
AlignRAG: Leveraging Critique Learning for Evidence-Sensitive Retrieval-Augmented Reasoning [61.28113271728859]
RAG has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>Standard RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>In this work, we reinterpret RAG as Retrieval-Augmented Reasoning and identify a central but underexplored problem: textitReasoning Misalignment.
arXiv Detail & Related papers (2025-04-21T04:56:47Z)
Feature-Aware Malicious Output Detection and Mitigation [8.378272216429954]
We propose a feature-aware method for harmful response rejection (FMM)<n>FMM detects the presence of malicious features within the model's feature space and adaptively adjusts the model's rejection mechanism.<n> Experimental results demonstrate the effectiveness of our approach across multiple language models and diverse attack techniques.
arXiv Detail & Related papers (2025-04-12T12:12:51Z)
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [0.0]
Large Language Models (LLMs) have revolutionized artificial intelligence, driving advancements in machine translation, summarization, and conversational agents.<n>Recent studies indicate that LLMs remain vulnerable to adversarial attacks designed to elicit biased responses.<n>This work proposes a scalable benchmarking framework to evaluate LLM robustness against adversarial bias elicitation.
arXiv Detail & Related papers (2025-04-10T16:00:59Z)
Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance. Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z)
Analyzing Adversarial Inputs in Deep Reinforcement Learning [53.3760591018817]
We present a comprehensive analysis of the characterization of adversarial inputs, through the lens of formal verification. We introduce a novel metric, the Adversarial Rate, to classify models based on their susceptibility to such perturbations. Our analysis empirically demonstrates how adversarial inputs can affect the safety of a given DRL system with respect to such perturbations.
arXiv Detail & Related papers (2024-02-07T21:58:40Z)
On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction. Inspired by these findings, we propose a method for safety prompt optimization, namely DRO. Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
Diffusion Denoising Process for Perceptron Bias in Out-of-distribution Detection [67.49587673594276]
We introduce a new perceptron bias assumption that suggests discriminator models are more sensitive to certain features of the input, leading to the overconfidence problem. We demonstrate that the diffusion denoising process (DDP) of DMs serves as a novel form of asymmetric, which is well-suited to enhance the input and mitigate the overconfidence problem. Our experiments on CIFAR10, CIFAR100, and ImageNet show that our method outperforms SOTA approaches.
arXiv Detail & Related papers (2022-11-21T08:45:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.