Related papers: Assessing Representation Stability for Transformer Models

Assessing Representation Stability for Transformer Models

URL: http://arxiv.org/abs/2508.11667v1
Date: Wed, 06 Aug 2025 21:07:49 GMT
Title: Assessing Representation Stability for Transformer Models
Authors: Bryan E. Tuck, Rakesh M. Verma,
Abstract summary: Adrial text attacks remain a persistent threat to transformer models.<n>We introduce Representation Stability (RS), a model-aversa detection framework.<n>RS measures how embedding representations change when important words are masked.
Score: 2.41710192205034
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Adversarial text attacks remain a persistent threat to transformer models, yet existing defenses are typically attack-specific or require costly model retraining. We introduce Representation Stability (RS), a model-agnostic detection framework that identifies adversarial examples by measuring how embedding representations change when important words are masked. RS first ranks words using importance heuristics, then measures embedding sensitivity to masking top-k critical words, and processes the resulting patterns with a BiLSTM detector. Experiments show that adversarially perturbed words exhibit disproportionately high masking sensitivity compared to naturally important words. Across three datasets, three attack types, and two victim models, RS achieves over 88% detection accuracy and demonstrates competitive performance compared to existing state-of-the-art methods, often at lower computational cost. Using Normalized Discounted Cumulative Gain (NDCG) to measure perturbation identification quality, we reveal that gradient-based ranking outperforms attention and random selection approaches, with identification quality correlating with detection performance for word-level attacks. RS also generalizes well to unseen datasets, attacks, and models without retraining, providing a practical solution for adversarial text detection.

Related papers

Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection [105.14032334647932]
Machine-generated texts (MGTs) pose risks such as disinformation and phishing, highlighting the need for reliable detection.<n> Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting.<n>We propose a Markov-informed score calibration strategy that models two relationships of context detection scores that may aid calibration.
arXiv Detail & Related papers (2026-02-08T16:06:12Z)
GradID: Adversarial Detection via Intrinsic Dimensionality of Gradients [0.1019561860229868]
In this paper, we investigate the geometric properties of a model's input loss landscape.<n>We reveal a distinct and consistent difference in the ID for natural and adversarial data, which forms the basis of our proposed detection method.<n>Our detector significantly surpasses existing methods against a wide array of attacks, including CW and AutoAttack, achieving detection rates consistently above 92% on CIFAR-10.
arXiv Detail & Related papers (2025-12-14T20:16:03Z)
Sensitivity of Small Language Models to Fine-tuning Data Contamination [0.0]
Small Language Models (SLMs) are increasingly being deployed in resource-constrained environments.<n>We measure susceptibility to syntactic and semantic transformation types during instruction tuning.<n>Character reversal produces near-complete failure across all models regardless of size or family.<n>Semantic transformations demonstrate distinct threshold behaviors and greater resilience in core linguistic capabilities.
arXiv Detail & Related papers (2025-11-10T06:44:29Z)
RHINO: Guided Reasoning for Mapping Network Logs to Adversarial Tactics and Techniques with Large Language Models [9.065322387043546]
We introduce RHINO, a framework that decomposes Large Language Models into three interpretable phases mirroring human reasoning.<n> RHINO bridges the semantic gap between low-level observations and adversarial intent while improving output reliability through structured reasoning.<n>Our results demonstrate that RHINO significantly enhances the interpretability and scalability of threat analysis, offering a blueprint for deploying LLMs in operational security settings.
arXiv Detail & Related papers (2025-10-16T02:25:46Z)
Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations [2.7620215077666557]
Modern detectors are notoriously vulnerable to adversarial attacks, with paraphrasing standing out as an effective evasion technique.<n>This paper presents a comparative study of adversarial robustness, first by quantifying the limitations of standard adversarial training.<n>We then introduce a novel, significantly more resilient detection framework: Perturbation-Invariant Feature Engineering.
arXiv Detail & Related papers (2025-09-22T13:03:53Z)
Crafting Imperceptible On-Manifold Adversarial Attacks for Tabular Data [41.69043684367127]
Adversarial attacks on tabular data present fundamental challenges distinct from image or text domains.<n>Traditional gradient-based methods prioritise $ell_p$-norm constraints, producing imperceptible adversarial examples.<n>We propose a latent space perturbation framework using a mixed-input Variational Autoencoder (VAE) to generate imperceptible adversarial examples.
arXiv Detail & Related papers (2025-07-15T05:34:44Z)
Exploring Gradient-Guided Masked Language Model to Detect Textual Adversarial Attacks [50.53590930588431]
adversarial examples pose serious threats to natural language processing systems.<n>Recent studies suggest that adversarial texts deviate from the underlying manifold of normal texts, whereas masked language models can approximate the manifold of normal data.<n>We first introduce Masked Language Model-based Detection (MLMD), leveraging mask unmask operations of the masked language modeling (MLM) objective.
arXiv Detail & Related papers (2025-04-08T14:10:57Z)
AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning [93.77763753231338]
Adversarial Contrastive Prompt Tuning (ACPT) is proposed to fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries. We show that ACPT can detect 7 state-of-the-art query-based attacks with $>99%$ detection rate within 5 shots. We also show that ACPT is robust to 3 types of adaptive attacks.
arXiv Detail & Related papers (2024-08-04T09:53:50Z)
Invariance-powered Trustworthy Defense via Remove Then Restore [7.785824663793149]
Adrial attacks pose a challenge to the deployment of deep neural networks (DNNs) Key finding is that salient attack in an adversarial sample dominates the attacking process. A Pixel Surgery and Semantic Regeneration model following the targeted therapy mechanism is developed.
arXiv Detail & Related papers (2024-02-01T03:34:48Z)
ODDR: Outlier Detection & Dimension Reduction Based Defense Against Adversarial Patches [4.4100683691177816]
Adversarial attacks present a significant challenge to the dependable deployment of machine learning models. We propose Outlier Detection and Dimension Reduction (ODDR), a comprehensive defense strategy to counteract patch-based adversarial attacks. Our approach is based on the observation that input features corresponding to adversarial patches can be identified as outliers.
arXiv Detail & Related papers (2023-11-20T11:08:06Z)
Counterfactual Image Generation for adversarially robust and interpretable Classifiers [1.3859669037499769]
We propose a unified framework leveraging image-to-image translation Generative Adrial Networks (GANs) to produce counterfactual samples. This is achieved by combining the classifier and discriminator into a single model that attributes real images to their respective classes and flags generated images as "fake" We show how the model exhibits improved robustness to adversarial attacks, and we show how the discriminator's "fakeness" value serves as an uncertainty measure of the predictions.
arXiv Detail & Related papers (2023-10-01T18:50:29Z)
How adversarial attacks can disrupt seemingly stable accurate classifiers [76.95145661711514]
Adversarial attacks dramatically change the output of an otherwise accurate learning system using a seemingly inconsequential modification to a piece of input data. Here, we show that this may be seen as a fundamental feature of classifiers working with high dimensional input data. We introduce a simple generic and generalisable framework for which key behaviours observed in practical systems arise with high probability.
arXiv Detail & Related papers (2023-09-07T12:02:00Z)
Semantic Image Attack for Visual Model Diagnosis [80.36063332820568]
In practice, metric analysis on a specific train and test dataset does not guarantee reliable or fair ML models. This paper proposes Semantic Image Attack (SIA), a method based on the adversarial attack that provides semantic adversarial images.
arXiv Detail & Related papers (2023-03-23T03:13:04Z)
Improving Adversarial Robustness to Sensitivity and Invariance Attacks with Deep Metric Learning [80.21709045433096]
A standard method in adversarial robustness assumes a framework to defend against samples crafted by minimally perturbing a sample. We use metric learning to frame adversarial regularization as an optimal transport problem. Our preliminary results indicate that regularizing over invariant perturbations in our framework improves both invariant and sensitivity defense.
arXiv Detail & Related papers (2022-11-04T13:54:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.