Related papers: SilentStriker:Toward Stealthy Bit-Flip Attacks on Large Language Models

SilentStriker:Toward Stealthy Bit-Flip Attacks on Large Language Models

URL: http://arxiv.org/abs/2509.17371v2
Date: Tue, 23 Sep 2025 03:08:31 GMT
Title: SilentStriker:Toward Stealthy Bit-Flip Attacks on Large Language Models
Authors: Haotian Xu, Qingsong Peng, Jie Shi, Huadi Zheng, Yu Li, Cheng Zhuo,
Abstract summary: Bit-Flip Attacks (BFAs) exploit hardware vulnerabilities to corrupt model parameters and cause severe performance degradation.<n>Existing BFA methods fail to balance performance degradation and output naturalness, making them prone to discovery.<n>SilentStriker is the first stealthy bit-flip attack against LLMs that effectively degrades task performance while maintaining output naturalness.
Score: 13.200372347541142
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid adoption of large language models (LLMs) in critical domains has spurred extensive research into their security issues. While input manipulation attacks (e.g., prompt injection) have been well studied, Bit-Flip Attacks (BFAs) -- which exploit hardware vulnerabilities to corrupt model parameters and cause severe performance degradation -- have received far less attention. Existing BFA methods suffer from key limitations: they fail to balance performance degradation and output naturalness, making them prone to discovery. In this paper, we introduce SilentStriker, the first stealthy bit-flip attack against LLMs that effectively degrades task performance while maintaining output naturalness. Our core contribution lies in addressing the challenge of designing effective loss functions for LLMs with variable output length and the vast output space. Unlike prior approaches that rely on output perplexity for attack loss formulation, which inevitably degrade output naturalness, we reformulate the attack objective by leveraging key output tokens as targets for suppression, enabling effective joint optimization of attack effectiveness and stealthiness. Additionally, we employ an iterative, progressive search strategy to maximize attack efficacy. Experiments show that SilentStriker significantly outperforms existing baselines, achieving successful attacks without compromising the naturalness of generated text.

Related papers

Dashed Line Defense: Plug-And-Play Defense Against Adaptive Score-Based Query Attacks [3.206339985805037]
Dashed Line Defense (DLD) is a plug-and-play post-processing method specifically designed to withstand adaptive query strategies.<n>By introducing ambiguity in how the observed loss reflects the true adversarial strength of candidate examples, DLD prevents attackers from reliably analyzing and adapting their queries.<n>We provide theoretical guarantees of DLD's defense capability and validate its effectiveness through experiments on ImageNet.
arXiv Detail & Related papers (2026-02-09T14:02:32Z)
DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z)
TopicAttack: An Indirect Prompt Injection Attack via Topic Transition [71.81906608221038]
Large language models (LLMs) are vulnerable to indirect prompt injection attacks.<n>We propose TopicAttack, which prompts the LLM to generate a fabricated transition prompt that gradually shifts the topic toward the injected instruction.<n>We find that a higher injected-to-original attention ratio leads to a greater success probability, and our method achieves a much higher ratio than the baseline methods.
arXiv Detail & Related papers (2025-07-18T06:23:31Z)
Sampling-aware Adversarial Attacks Against Large Language Models [52.30089653615172]
Existing adversarial attacks typically target harmful responses in single-point greedy generations.<n>We show that for the goal of eliciting harmful responses, repeated sampling of model outputs during the attack prompt optimization.<n>We show that integrating sampling into existing attacks boosts success rates by up to 37% and improves efficiency by up to two orders of magnitude.
arXiv Detail & Related papers (2025-07-06T16:13:33Z)
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs [54.90315421117162]
We propose a novel poisoning method via completely harmless data.<n>Inspired by the causal reasoning in auto-regressive LLMs, we aim to establish robust associations between triggers and an affirmative response prefix.<n>Our method effectively injects backdoors into various LLMs for harmful content generation, even under the detection of powerful guardrail models.
arXiv Detail & Related papers (2025-05-23T08:13:59Z)
BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models [19.856128742435814]
This paper introduces a new type of inference cost attacks (dubbed 'bit-flip inference cost attack') that target the victim model itself rather than its inputs.<n>Specifically, we design a simple yet effective method (dubbed 'BitHydra') to effectively flip critical bits of model parameters.<n>With just 4 search samples and as few as 3 bit flips, BitHydra can force 100% of test prompts to reach the maximum generation length.
arXiv Detail & Related papers (2025-05-22T13:36:00Z)
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in CLIP [51.04452017089568]
Class-wise Backdoor Prompt Tuning (CBPT) is an efficient and effective defense mechanism that operates on text prompts to indirectly purify CLIP.<n>CBPT significantly mitigates backdoor threats while preserving model utility.
arXiv Detail & Related papers (2025-02-26T16:25:15Z)
ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models [55.93380086403591]
Generative large language models are vulnerable to backdoor attacks.<n>$textitELBA-Bench$ allows attackers to inject backdoor through parameter efficient fine-tuning.<n>$textitELBA-Bench$ provides over 1300 experiments.
arXiv Detail & Related papers (2025-02-22T12:55:28Z)
Watch Out for Your Guidance on Generation! Exploring Conditional Backdoor Attacks against Large Language Models [8.348993615202138]
backdoor attacks on large language models (LLMs) typically set a fixed trigger in the input instance and specific responses for triggered queries.<n>We present a new poisoning paradigm against LLMs triggered by specifying generation conditions.<n>The poisoned model performs normally for output under normal/other generation conditions, while becoming harmful for output under target generation conditions.
arXiv Detail & Related papers (2024-04-23T07:19:20Z)
Attacking Large Language Models with Projected Gradient Descent [49.19426387912186]
Projected Gradient Descent (PGD) for adversarial prompts is up to one order of magnitude faster than state-of-the-art discrete optimization.<n>Our PGD for LLMs is up to one order of magnitude faster than state-of-the-art discrete optimization to achieve the same devastating attack results.
arXiv Detail & Related papers (2024-02-14T13:13:26Z)
LEAT: Towards Robust Deepfake Disruption in Real-World Scenarios via Latent Ensemble Attack [11.764601181046496]
Deepfakes, malicious visual contents created by generative models, pose an increasingly harmful threat to society. To proactively mitigate deepfake damages, recent studies have employed adversarial perturbation to disrupt deepfake model outputs. We propose a simple yet effective disruption method called Latent Ensemble ATtack (LEAT), which attacks the independent latent encoding process.
arXiv Detail & Related papers (2023-07-04T07:00:37Z)
Versatile Weight Attack via Flipping Limited Bits [68.45224286690932]
We study a novel attack paradigm, which modifies model parameters in the deployment stage. Considering the effectiveness and stealthiness goals, we provide a general formulation to perform the bit-flip based weight attack. We present two cases of the general formulation with different malicious purposes, i.e., single sample attack (SSA) and triggered samples attack (TSA)
arXiv Detail & Related papers (2022-07-25T03:24:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.