Related papers: Reasoning Hijacking: Subverting LLM Classification via Decision-Criteria Injection

Reasoning Hijacking: Subverting LLM Classification via Decision-Criteria Injection

URL: http://arxiv.org/abs/2601.10294v1
Date: Thu, 15 Jan 2026 11:12:08 GMT
Title: Reasoning Hijacking: Subverting LLM Classification via Decision-Criteria Injection
Authors: Yuansen Liu, Yixuan Tang, Anthony Kum Hoe Tun,
Abstract summary: We propose a new adversarial paradigm: Reasoning Hijacking and instantiate it with Criteria Attack.<n>Unlike Goal Hijacking, which attempts to override the system prompt, Reasoning Hijacking accepts the high-level goal but manipulates the model's decision-making logic.<n>Because the model's "intent" remains aligned with the user's instructions, these attacks can bypass defenses designed to detect goal deviation.
Score: 4.682489563620585
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current LLM safety research predominantly focuses on mitigating Goal Hijacking, preventing attackers from redirecting a model's high-level objective (e.g., from "summarizing emails" to "phishing users"). In this paper, we argue that this perspective is incomplete and highlight a critical vulnerability in Reasoning Alignment. We propose a new adversarial paradigm: Reasoning Hijacking and instantiate it with Criteria Attack, which subverts model judgments by injecting spurious decision criteria without altering the high-level task goal. Unlike Goal Hijacking, which attempts to override the system prompt, Reasoning Hijacking accepts the high-level goal but manipulates the model's decision-making logic by injecting spurious reasoning shortcut. Though extensive experiments on three different tasks (toxic comment, negative review, and spam detection), we demonstrate that even newest models are prone to prioritize injected heuristic shortcuts over rigorous semantic analysis. The results are consistent over different backbones. Crucially, because the model's "intent" remains aligned with the user's instructions, these attacks can bypass defenses designed to detect goal deviation (e.g., SecAlign, StruQ), exposing a fundamental blind spot in the current safety landscape. Data and code are available at https://github.com/Yuan-Hou/criteria_attack

Related papers

Defenses Against Prompt Attacks Learn Surface Heuristics [40.392588465939106]
Large language models (LLMs) are increasingly deployed in security-sensitive applications.<n>LLMs may override intended logic when adversarial instructions appear in user queries or retrieved content.<n>Recent defenses rely on supervised fine-tuning with benign and malicious labels.
arXiv Detail & Related papers (2026-01-12T04:12:48Z)
Aligning Deep Implicit Preferences by Learning to Reason Defensively [22.548051297731416]
We propose Critique-Driven Reasoning Alignment (CDRA) to bridge the preference inference gap.<n>CDRA reframes alignment from a scalar reward-matching task into a structured reasoning process.<n> Experiments demonstrate that CDRA excels at discovering and aligning with users' true preferences while executing robust reasoning.
arXiv Detail & Related papers (2025-10-13T09:26:47Z)
One Token Embedding Is Enough to Deadlock Your Large Reasoning Model [91.48868589442837]
We present the Deadlock Attack, a resource exhaustion method that hijacks an LRM's generative control flow.<n>Our method achieves a 100% attack success rate across four advanced LRMs.
arXiv Detail & Related papers (2025-10-12T07:42:57Z)
Refusal Falls off a Cliff: How Safety Alignment Fails in Reasoning? [68.82210578851442]
We investigate why safety alignment fails in reasoning models through a mechanistic interpretability lens.<n>Using a linear probing approach to trace refusal intentions across token positions, we discover a phenomenon termed as textbfrefusal cliff<n>We propose textbfCliff-as-a-Judge, a novel data selection method that identifies training examples exhibiting the largest refusal cliff to efficiently repair reasoning models' safety alignment.
arXiv Detail & Related papers (2025-10-07T15:32:59Z)
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>We observe an interesting resistance phenomenon where the LLM initially appears to agree but subsequently refuses to answer.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
Backdooring Outlier Detection Methods: A Novel Attack Approach [2.19238269573727]
Outlier detection is crucial for deploying classifiers in critical real-world applications.<n>We propose BATOD, a novel Backdoor Attack targeting the Outlier Detection task.
arXiv Detail & Related papers (2024-12-06T13:03:22Z)
Pseudo-Conversation Injection for LLM Goal Hijacking [3.574664325523221]
In goal hijacking, an attacker typically appends a carefully crafted malicious suffix to the user's prompt. We introduce a novel goal hijacking attack method called Pseudo-Conversation Injection. We propose three Pseudo-Conversation construction strategies: Targeted Pseudo-Conversation, Universal Pseudo-Conversation, and Robust Pseudo-Conversation.
arXiv Detail & Related papers (2024-10-31T06:58:34Z)
You Know What I'm Saying: Jailbreak Attack via Implicit Reference [22.520950422702757]
This study identifies a previously overlooked vulnerability, which we term Attack via Implicit Reference (AIR) AIR decomposes a malicious objective into permissible objectives and links them through implicit references within the context. Our experiments demonstrate AIR's effectiveness across state-of-the-art LLMs, achieving an attack success rate (ASR) exceeding 90% on most models.
arXiv Detail & Related papers (2024-10-04T18:42:57Z)
AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning [93.77763753231338]
Adversarial Contrastive Prompt Tuning (ACPT) is proposed to fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries. We show that ACPT can detect 7 state-of-the-art query-based attacks with $>99%$ detection rate within 5 shots. We also show that ACPT is robust to 3 types of adaptive attacks.
arXiv Detail & Related papers (2024-08-04T09:53:50Z)
Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries. Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill. We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z)
Object-fabrication Targeted Attack for Object Detection [54.10697546734503]
adversarial attack for object detection contains targeted attack and untargeted attack. New object-fabrication targeted attack mode can mislead detectors tofabricate extra false objects with specific target labels.
arXiv Detail & Related papers (2022-12-13T08:42:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.