Related papers: TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention

URL: http://arxiv.org/abs/2601.21900v2
Date: Sat, 31 Jan 2026 01:26:24 GMT
Title: TraceRouter: Robust Safety for Large Foundation Models via Path-Level Intervention
Authors: Chuancheng Shi, Shangze Li, Wenjun Lu, Wenhua Wu, Cong Wang, Zifeng Cheng, Fei Shen, Tat-Seng Chua,
Abstract summary: harmful semantics act as distributed, cross-layer circuits, rendering localized interventions brittle and detrimental to utility.<n>We propose textbfTrace, a path-level framework that traces and disconnects the causal propagation circuits of illicit semantics.<n>Trace significantly outperforms state-of-the-art baselines, achieving a superior trade-off between adversarial robustness and general utility.
Score: 44.64827167753535
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite their capabilities, large foundation models (LFMs) remain susceptible to adversarial manipulation. Current defenses predominantly rely on the "locality hypothesis", suppressing isolated neurons or features. However, harmful semantics act as distributed, cross-layer circuits, rendering such localized interventions brittle and detrimental to utility. To bridge this gap, we propose \textbf{TraceRouter}, a path-level framework that traces and disconnects the causal propagation circuits of illicit semantics. TraceRouter operates in three stages: (1) it pinpoints a sensitive onset layer by analyzing attention divergence; (2) it leverages sparse autoencoders (SAEs) and differential activation analysis to disentangle and isolate malicious features; and (3) it maps these features to downstream causal pathways via feature influence scores (FIS) derived from zero-out interventions. By selectively suppressing these causal chains, TraceRouter physically severs the flow of harmful information while leaving orthogonal computation routes intact. Extensive experiments demonstrate that TraceRouter significantly outperforms state-of-the-art baselines, achieving a superior trade-off between adversarial robustness and general utility. Our code will be publicly released. WARNING: This paper contains unsafe model responses.

Related papers

Beyond Input Guardrails: Reconstructing Cross-Agent Semantic Flows for Execution-Aware Attack Detection [32.301679396929536]
We propose SysName, a framework that shifts the defensive paradigm from static input filtering to execution-aware analysis.<n>SysName synthesizes fragmented operational primitives into contiguous behavioral trajectories, enabling a holistic view of system activity.<n> Empirical evaluations demonstrate that SysName effectively detects over ten distinct compound attack vectors, achieving F1-scores of 85.3% and 66.7% for node-level and path-level end-to-end attack detection, respectively.
arXiv Detail & Related papers (2026-03-04T01:59:16Z)
TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models [19.148124494194317]
We propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls.<n>Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy.<n>We demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive.
arXiv Detail & Related papers (2026-03-02T22:19:13Z)
TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training [53.93696896939915]
Training tool-use agents typically rely on Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks.<n>We propose TopoCurate, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology.<n>TopoCurate achieves consistent gains of 4.2% (SFT) and 6.9% (RL) over state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-02T10:38:54Z)
Generalizing GNNs with Tokenized Mixture of Experts [75.8310720413187]
We show that improving stability requires reducing reliance on shift-sensitive features, leaving an irreducible worst-case generalization floor.<n>We propose STEM-GNN, a pretrain-then-finetune framework with a mixture-of-experts encoder for diverse computation paths.<n>Across nine node, link, and graph benchmarks, STEM-GNN achieves a stronger three-way balance, improving robustness to degree/homophily shifts and to feature/edge corruptions while remaining competitive on clean graphs.
arXiv Detail & Related papers (2026-02-09T22:48:30Z)
RedVisor: Reasoning-Aware Prompt Injection Defense via Zero-Copy KV Cache Reuse [47.85771791033142]
We propose RedVisor, a framework that synthesizes the explainability of detection systems with the seamless integration of prevention strategies.<n>RedVisor is the first approach to leverage fine-grained reasoning paths to simultaneously detect attacks and guide the model's safe response.<n> Experiments demonstrate that RedVisor outperforms state-of-the-art defenses in detection accuracy and throughput while incurring negligible utility loss.
arXiv Detail & Related papers (2026-02-02T08:26:51Z)
The Outline of Deception: Physical Adversarial Attacks on Traffic Signs Using Edge Patches [6.836569632189732]
This study proposes TESP-Attack, a novel stealth-aware adversarial patch method for traffic sign classification.<n>Based on the observation that human visual attention primarily focuses on the central regions of traffic signs, we employ instance segmentation to generate edge-aligned masks.<n>A U-Net generator is utilized to craft adversarial patches, which are then optimized through color and texture constraints.
arXiv Detail & Related papers (2025-11-30T07:26:07Z)
Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety [40.92620214527198]
Reasoning trajectories in Long-CoT models can drift from aligned paths, resulting in content that violates safety constraints.<n>We introduce a three-stage Path Drift Induction Framework comprising cognitive load amplification, self-role priming, and condition chain hijacking.<n>Our findings highlight the need for trajectory-level alignment oversight in long-form reasoning beyond token-level alignment.
arXiv Detail & Related papers (2025-10-11T04:39:50Z)
Lateral Movement Detection via Time-aware Subgraph Classification on Authentication Logs [4.893077353126799]
Lateral movement is a crucial component of advanced persistent threat (APT) attacks in networks. We propose a multi-scale lateral movement detection framework called LMDetect.
arXiv Detail & Related papers (2024-11-15T15:35:56Z)
Evaluating the Robustness of Off-Road Autonomous Driving Segmentation against Adversarial Attacks: A Dataset-Centric analysis [1.6538732383658392]
This study investigates the vulnerability of semantic segmentation models to adversarial input perturbations. We compare the effects of adversarial attacks on different segmentation network architectures. This work contributes to the safe navigation of autonomous robot Unimog U5023 in rough off-road unstructured environments.
arXiv Detail & Related papers (2024-02-03T13:48:57Z)
Fuzzy Attention Neural Network to Tackle Discontinuity in Airway Segmentation [67.19443246236048]
Airway segmentation is crucial for the examination, diagnosis, and prognosis of lung diseases. Some small-sized airway branches (e.g., bronchus and terminaloles) significantly aggravate the difficulty of automatic segmentation. This paper presents an efficient method for airway segmentation, comprising a novel fuzzy attention neural network and a comprehensive loss function.
arXiv Detail & Related papers (2022-09-05T16:38:13Z)
Road Network Guided Fine-Grained Urban Traffic Flow Inference [108.64631590347352]
Accurate inference of fine-grained traffic flow from coarse-grained one is an emerging yet crucial problem. We propose a novel Road-Aware Traffic Flow Magnifier (RATFM) that exploits the prior knowledge of road networks. Our method can generate high-quality fine-grained traffic flow maps.
arXiv Detail & Related papers (2021-09-29T07:51:49Z)
Exploring Robustness of Unsupervised Domain Adaptation in Semantic Segmentation [74.05906222376608]
We propose adversarial self-supervision UDA (or ASSUDA) that maximizes the agreement between clean images and their adversarial examples by a contrastive loss in the output space. This paper is rooted in two observations: (i) the robustness of UDA methods in semantic segmentation remains unexplored, which pose a security concern in this field; and (ii) although commonly used self-supervision (e.g., rotation and jigsaw) benefits image tasks such as classification and recognition, they fail to provide the critical supervision signals that could learn discriminative representation for segmentation tasks.
arXiv Detail & Related papers (2021-05-23T01:50:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.