Rethinking Latency Denial-of-Service: Attacking the LLM Serving Framework, Not the Model
- URL: http://arxiv.org/abs/2602.07878v1
- Date: Sun, 08 Feb 2026 09:05:54 GMT
- Title: Rethinking Latency Denial-of-Service: Attacking the LLM Serving Framework, Not the Model
- Authors: Tianyi Wang, Huawei Fan, Yuanchao Shu, Peng Cheng, Cong Wang,
- Abstract summary: Large Language Models face an emerging and critical threat known as latency attacks.<n>Because inference is inherently expensive, even modest slowdowns can translate into substantial operating costs and severe availability risks.<n>We introduce a new Fill and Squeeze attack strategy targeting the state transition of the scheduler.
- Score: 12.046157489400457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models face an emerging and critical threat known as latency attacks. Because LLM inference is inherently expensive, even modest slowdowns can translate into substantial operating costs and severe availability risks. Recently, a growing body of research has focused on algorithmic complexity attacks by crafting inputs to trigger worst-case output lengths. However, we report a counter-intuitive finding that these algorithmic latency attacks are largely ineffective against modern LLM serving systems. We reveal that system-level optimization such as continuous batching provides a logical isolation to mitigate contagious latency impact on co-located users. To this end, in this paper, we shift the focus from the algorithm to the system layer, and introduce a new Fill and Squeeze attack strategy targeting the state transition of the scheduler. "Fill" first exhausts the global KV cache to induce Head-of-Line blocking, while "Squeeze" forces the system into repetitive preemption. By manipulating output lengths using methods from simple plain-text prompts to more complex prompt engineering, and leveraging side-channel probing of memory status, we demonstrate that the attack can be orchestrated in a black-box setting with much less cost. Extensive evaluations indicate by up to 20-280x average slowdown on Time to First Token and 1.5-4x average slowdown on Time Per Output Token compared to existing attacks with 30-40% lower attack cost.
Related papers
- DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern [23.834578989358423]
We introduce DualSentinel, a lightweight and unified defense framework.<n>It can accurately and promptly detect the activation of targeted attacks alongside the Large Language Models generation process.<n>It is highly effective (superior detection accuracy with near-zero false positives) and remarkably efficient (negligible additional cost)
arXiv Detail & Related papers (2026-03-02T08:02:47Z) - Time Is All It Takes: Spike-Retiming Attacks on Event-Driven Spiking Neural Networks [87.16809558673403]
Spiking neural networks (SNNs) compute with discrete spikes and exploit temporal structure.<n>We study a timing-only adversary that retimes existing spikes while preserving spike counts and amplitudes in event-driven SNNs.
arXiv Detail & Related papers (2026-02-03T09:06:53Z) - Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference [29.81657023400426]
Token-oriented methods, such as pruning and skipping, have shown promise in reducing inference latency.<n>We propose SPTS (Self-Predictive Token Skipping), a training-free framework for efficient long-context inference.
arXiv Detail & Related papers (2026-01-19T15:34:29Z) - HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network [50.33808558714122]
Large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy.<n>We propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network.<n> Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions.
arXiv Detail & Related papers (2026-01-16T07:37:23Z) - Behavior-Equivalent Token: Single-Token Replacement for Long Prompts in LLMs [55.827877498548965]
We propose a lightweight training framework that learns a single prompt-specific Behavior-Equivalent token ([BE])<n>The framework first trains [BE] to encode the natural-language content of the original system prompt via reconstruction, and then distills the prompt's downstream behavior into this single token.<n> Empirical evaluations on three datasets show that a single [BE] token achieves up to a 3000x reduction in prompt length, while retaining about 98% of the downstream performance of the original system prompts.
arXiv Detail & Related papers (2025-11-28T15:22:52Z) - SecInfer: Preventing Prompt Injection via Inference-time Scaling [54.21558811232143]
We propose emphSecInfer, a novel defense against prompt injection attacks built on emphinference-time scaling<n>We show that SecInfer effectively mitigates both existing and adaptive prompt injection attacks, outperforming state-of-the-art defenses as well as existing inference-time scaling approaches.
arXiv Detail & Related papers (2025-09-29T16:00:41Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z) - Attacking Large Language Models with Projected Gradient Descent [49.19426387912186]
Projected Gradient Descent (PGD) for adversarial prompts is up to one order of magnitude faster than state-of-the-art discrete optimization.<n>Our PGD for LLMs is up to one order of magnitude faster than state-of-the-art discrete optimization to achieve the same devastating attack results.
arXiv Detail & Related papers (2024-02-14T13:13:26Z) - No-Skim: Towards Efficiency Robustness Evaluation on Skimming-based
Language Models [27.469321590884903]
We propose No-Skim to help the owners of skimming-based LLM to understand and measure the robustness of their acceleration scheme.
Specifically, our framework searches minimal and unnoticeable perturbations at character-level and token-level to generate adversarial inputs that sufficiently increase the remaining token ratio.
In the worst case, the perturbation found by No-Skim substantially increases the running cost of LLM by over 145% on average.
arXiv Detail & Related papers (2023-12-15T02:42:05Z) - Overload: Latency Attacks on Object Detection for Edge Devices [47.9744734181236]
This paper investigates latency attacks on deep learning applications.
Unlike common adversarial attacks for misclassification, the goal of latency attacks is to increase the inference time.
We use object detection to demonstrate how such kind of attacks work.
arXiv Detail & Related papers (2023-04-11T17:24:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.