TFL: Targeted Bit-Flip Attack on Large Language Model
- URL: http://arxiv.org/abs/2602.17837v1
- Date: Thu, 19 Feb 2026 20:59:47 GMT
- Title: TFL: Targeted Bit-Flip Attack on Large Language Model
- Authors: Jingkai Guo, Chaitali Chakrabarti, Deliang Fan,
- Abstract summary: Large language models (LLMs) are increasingly deployed in safety and security critical applications.<n>We present TFL, a novel targeted bit-flip attack framework.<n>Within our TFL framework, we propose a novel keyword-focused attack loss to promote attacker-specified target tokens in generative outputs.
- Score: 16.379863498328955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) are increasingly deployed in safety and security critical applications, raising concerns about their robustness to model parameter fault injection attacks. Recent studies have shown that bit-flip attacks (BFAs), which exploit computer main memory (i.e., DRAM) vulnerabilities to flip a small number of bits in model weights, can severely disrupt LLM behavior. However, existing BFA on LLM largely induce un-targeted failure or general performance degradation, offering limited control over manipulating specific or targeted outputs. In this paper, we present TFL, a novel targeted bit-flip attack framework that enables precise manipulation of LLM outputs for selected prompts while maintaining almost no or minor degradation on unrelated inputs. Within our TFL framework, we propose a novel keyword-focused attack loss to promote attacker-specified target tokens in generative outputs, together with an auxiliary utility score that balances attack effectiveness against collateral performance impact on benign data. We evaluate TFL on multiple LLMs (Qwen, DeepSeek, Llama) and benchmarks (DROP, GSM8K, and TriviaQA). The experiments show that TFL achieves successful targeted LLM output manipulations with less than 50 bit flips and significantly reduced effect on unrelated queries compared to prior BFA approaches. This demonstrates the effectiveness of TFL and positions it as a new class of stealthy and targeted LLM model attack.
Related papers
- Friend or Foe: How LLMs' Safety Mind Gets Fooled by Intent Shift Attack [53.34204977366491]
Large language models (LLMs) remain vulnerable to jailbreaking attacks despite their impressive capabilities.<n>In this paper, we introduce ISA (Intent Shift Attack), which obfuscates LLMs about the intent of the attacks.<n>Our approach only needs minimal edits to the original request, and yields natural, human-readable, and seemingly harmless prompts.
arXiv Detail & Related papers (2025-11-01T13:44:42Z) - SBFA: Single Sneaky Bit Flip Attack to Break Large Language Models [16.379863498328955]
Bit-Flip Attacks can severely compromise Deep Neural Networks (DNNs)<n>We propose SBFA (Sneaky Bit-Flip Attack), which collapses LLM performance with only one single bit flip.<n>It is achieved through iterative searching and ranking through our defined parameter sensitivity metric, ImpactScore.
arXiv Detail & Related papers (2025-09-26T04:03:53Z) - SilentStriker:Toward Stealthy Bit-Flip Attacks on Large Language Models [13.200372347541142]
Bit-Flip Attacks (BFAs) exploit hardware vulnerabilities to corrupt model parameters and cause severe performance degradation.<n>Existing BFA methods fail to balance performance degradation and output naturalness, making them prone to discovery.<n>SilentStriker is the first stealthy bit-flip attack against LLMs that effectively degrades task performance while maintaining output naturalness.
arXiv Detail & Related papers (2025-09-22T05:36:18Z) - MEF: A Capability-Aware Multi-Encryption Framework for Evaluating Vulnerabilities in Black-Box Large Language Models [5.645247459469767]
We propose a capability-aware Multi-Encryption Framework (MEF) for evaluating vulnerabilities in black-box LLMs.<n>For models with limited comprehension ability, MEF adopts the Fu+En1 strategy, which integrates layered semantic mutations with an encryption technique.<n>For models with strong comprehension ability, MEF uses a more complex Fu+En1+En2 strategy, in which additional dual-ended encryption techniques are applied to the LLM's responses.
arXiv Detail & Related papers (2025-05-29T12:50:57Z) - Exploring the limits of strong membership inference attacks on large language models [70.49900359876595]
State-of-the-art membership inference attacks (MIAs) typically require training many reference models.<n>We show that strong MIAs can succeed on pre-trained language models, but their effectiveness remains limited.
arXiv Detail & Related papers (2025-05-24T16:23:43Z) - Understanding and Enhancing the Transferability of Jailbreaking Attacks [12.446931518819875]
Jailbreaking attacks can effectively manipulate open-source large language models (LLMs) to produce harmful responses.<n>This work investigates the transferability of jailbreaking attacks by analysing their impact on the model's intent perception.<n>We propose the Perceived-importance Flatten (PiF) method, which uniformly disperses the model's focus across neutral-intent tokens in the original input.
arXiv Detail & Related papers (2025-02-05T10:29:54Z) - From Yes-Men to Truth-Tellers: Addressing Sycophancy in Large Language Models with Pinpoint Tuning [91.79567270986901]
Large Language Models (LLMs) tend to prioritize adherence to user prompts over providing veracious responses.<n>Recent works propose to employ supervised fine-tuning (SFT) to mitigate the sycophancy issue.<n>We propose a novel supervised pinpoint tuning (SPT), where the region-of-interest modules are tuned for a given objective.
arXiv Detail & Related papers (2024-09-03T07:01:37Z) - Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes [0.0]
Large Language Models (LLMs) have gained widespread adoption across various domains, including chatbots and auto-task completion agents.
These models are susceptible to safety vulnerabilities such as jailbreaking, prompt injection, and privacy leakage attacks.
This study investigates the impact of these modifications on LLM safety, a critical consideration for building reliable and secure AI systems.
arXiv Detail & Related papers (2024-04-05T20:31:45Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z) - Coercing LLMs to do and reveal (almost) anything [80.8601180293558]
It has been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
We argue that the spectrum of adversarial attacks on LLMs is much larger than merely jailbreaking.
arXiv Detail & Related papers (2024-02-21T18:59:13Z) - SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks [99.23352758320945]
We propose SmoothLLM, the first algorithm designed to mitigate jailbreaking attacks on large language models (LLMs)
Based on our finding that adversarially-generated prompts are brittle to character-level changes, our defense first randomly perturbs multiple copies of a given input prompt, and then aggregates the corresponding predictions to detect adversarial inputs.
arXiv Detail & Related papers (2023-10-05T17:01:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.