Related papers: Adversarial Contrastive Learning for LLM Quantization Attacks

Adversarial Contrastive Learning for LLM Quantization Attacks

URL: http://arxiv.org/abs/2601.02680v1
Date: Tue, 06 Jan 2026 03:26:11 GMT
Title: Adversarial Contrastive Learning for LLM Quantization Attacks
Authors: Dinghong Song, Zhiwei Xu, Hai Wan, Xibin Zhao, Pengfei Su, Dong Li,
Abstract summary: Adversarial Contrastive Learning (ACL) is a gradient-based quantization attack that achieves superior attack effectiveness.<n>ACL formulates the attack objective as a triplet-based contrastive loss, and integrates it with a projected descent two-stage distributed fine-tuning strategy.<n>Experiments demonstrate ACL's remarkable effectiveness, achieving attack success rates of 86.00% for over-refusal, 97.69% for jailbreak, and 92.40% for advertisement injection.
Score: 28.158356717114845
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Model quantization is critical for deploying large language models (LLMs) on resource-constrained hardware, yet recent work has revealed severe security risks that benign LLMs in full precision may exhibit malicious behaviors after quantization. In this paper, we propose Adversarial Contrastive Learning (ACL), a novel gradient-based quantization attack that achieves superior attack effectiveness by explicitly maximizing the gap between benign and harmful responses probabilities. ACL formulates the attack objective as a triplet-based contrastive loss, and integrates it with a projected gradient descent two-stage distributed fine-tuning strategy to ensure stable and efficient optimization. Extensive experiments demonstrate ACL's remarkable effectiveness, achieving attack success rates of 86.00% for over-refusal, 97.69% for jailbreak, and 92.40% for advertisement injection, substantially outperforming state-of-the-art methods by up to 44.67%, 18.84%, and 50.80%, respectively.

Related papers

SASER: Stego attacks on open-source LLMs [14.7664610166861]
SASER is the first stego attack on open-source large language models (LLMs)<n>It wields impacts through identifying targeted parameters, embedding payloads, injecting triggers, and executing payloads sequentially.<n>Experiments on LlaMA2-7B and ChatGLM3-6B, without quantization, show that SASER outperforms existing stego attacks by up to 98.1%.
arXiv Detail & Related papers (2025-10-12T07:33:56Z)
Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling [30.157082498075315]
Direct Prompt Injection (DPI) attacks pose a critical security threat to Large Language Models (LLMs) due to their low barrier of execution and high potential damage.<n>To address the impracticality of existing white-box/gray-box methods and the poor transferability of black-box methods, we propose an activations-guided prompt injection attack framework.
arXiv Detail & Related papers (2025-09-09T11:42:06Z)
Sampling-aware Adversarial Attacks Against Large Language Models [52.30089653615172]
Existing adversarial attacks typically target harmful responses in single-point greedy generations.<n>We show that for the goal of eliciting harmful responses, repeated sampling of model outputs during the attack prompt optimization.<n>We show that integrating sampling into existing attacks boosts success rates by up to 37% and improves efficiency by up to two orders of magnitude.
arXiv Detail & Related papers (2025-07-06T16:13:33Z)
ICLShield: Exploring and Mitigating In-Context Learning Backdoor Attacks [61.06621533874629]
In-context learning (ICL) has demonstrated remarkable success in large language models (LLMs)<n>In this paper, we propose, for the first time, the dual-learning hypothesis, which posits that LLMs simultaneously learn both the task-relevant latent concepts and backdoor latent concepts.<n>Motivated by these findings, we propose ICLShield, a defense mechanism that dynamically adjusts the concept preference ratio.
arXiv Detail & Related papers (2025-07-02T03:09:20Z)
Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization [74.78433600288776]
HKVE (Hierarchical Key-Value Equalization) is an innovative jailbreaking framework that selectively accepts gradient optimization results.<n>We show that HKVE substantially outperforms existing methods by substantially outperforming existing methods by margins of 20.43%, 21.01% and 26.43% respectively.
arXiv Detail & Related papers (2025-03-14T17:57:42Z)
Guiding not Forcing: Enhancing the Transferability of Jailbreaking Attacks on LLMs via Removing Superfluous Constraints [81.14852921721793]
This study aims to understand and enhance the transferability of gradient-based jailbreaking methods.<n>We introduce a novel conceptual framework to elucidate transferability and identify superfluous constraints.<n>Our method increases the overall Transfer Attack Success Rate (T-ASR) across a set of target models with varying safety levels from 18.4% to 50.3%.
arXiv Detail & Related papers (2025-02-25T07:47:41Z)
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models [92.79804303337522]
Vision-Language Models (VLMs) may still be vulnerable to safety alignment issues.<n>We introduce MLAI, a novel jailbreak framework that leverages scenario-aware image generation for semantic alignment.<n>Extensive experiments demonstrate MLAI's significant impact, achieving attack success rates of 77.75% on MiniGPT-4 and 82.80% on LLaVA-2.
arXiv Detail & Related papers (2024-11-27T02:40:29Z)
Membership Inference Attacks Against In-Context Learning [26.57639819629732]
We present the first membership inference attack tailored for In-Context Learning (ICL) We propose four attack strategies tailored to various constrained scenarios. We investigate three potential defenses targeting data, instruction, and output.
arXiv Detail & Related papers (2024-09-02T17:23:23Z)
Improved Generation of Adversarial Examples Against Safety-aligned LLMs [72.38072942860309]
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks. We show that 87% of the query-specific adversarial suffixes generated by the developed combination can induce Llama-2-7B-Chat to produce the output that exactly matches the target string on AdvBench.
arXiv Detail & Related papers (2024-05-28T06:10:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.