Related papers: Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes

Related papers

Analysis of LLMs Against Prompt Injection and Jailbreak Attacks [7.685814179879813]
This work evaluates prompt-injection and jailbreak vulnerability using a large, manually curated dataset.<n>We observe significant behavioural variation across models, including refusal responses and complete silent non-responsiveness triggered by internal safety mechanisms.
arXiv Detail & Related papers (2026-02-24T12:32:11Z)
Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z)
GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing [13.267217024192535]
Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs)<n>We introduce GuardVal, a new evaluation protocol that generates and refines jailbreak prompts based on the defender LLM's state.<n>We apply this protocol to a diverse set of models, from Mistral-7b to GPT-4, across 10 safety domains.
arXiv Detail & Related papers (2025-07-10T13:15:20Z)
Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement [48.50995874445193]
Large Language Models (LLMs) have shown impressive capabilities across various tasks but remain vulnerable to meticulously crafted jailbreak attacks.<n>We propose SAGE (Self-Aware Guard Enhancement), a training-free defense strategy designed to align LLMs' strong safety discrimination performance with their relatively weaker safety generation ability.
arXiv Detail & Related papers (2025-05-17T15:54:52Z)
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment. Certain scenarios suffer 25 times higher attack rates. Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z)
Can Small Language Models Reliably Resist Jailbreak Attacks? A Comprehensive Evaluation [10.987263424166477]
Small language models (SLMs) have emerged as promising alternatives to large language models (LLMs) In this paper, we conduct the first large-scale empirical study of SLMs' vulnerabilities to jailbreak attacks. We identify four key factors: model size, model architecture, training datasets and training techniques.
arXiv Detail & Related papers (2025-03-09T08:47:16Z)
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [54.10710423370126]
We propose Reasoning-to-Defend (R2D), a training paradigm that integrates a safety-aware reasoning mechanism into Large Language Models' generation process.<n>CPO enhances the model's perception of the safety status of given dialogues.<n>Experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances.
arXiv Detail & Related papers (2025-02-18T15:48:46Z)
Adversarial Reasoning at Jailbreaking Time [49.70772424278124]
We develop an adversarial reasoning approach to automatic jailbreaking via test-time computation. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.
arXiv Detail & Related papers (2025-02-03T18:59:01Z)
You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense [34.023473699165315]
We study the utility degradation, safety elevation, and exaggerated-safety escalation of LLMs with jailbreak defense strategies. We find that mainstream jailbreak defenses fail to ensure both safety and performance simultaneously.
arXiv Detail & Related papers (2025-01-21T15:24:29Z)
HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment [1.8843687952462742]
This paper aims to address gaps in the current literature on jailbreaking techniques and the evaluation of LLM vulnerabilities. Our contributions include the creation of a novel dataset designed to assess the harmfulness of model outputs across multiple harm levels. We provide a comprehensive benchmark of state-of-the-art jailbreaking attacks, specifically targeting the Vicuna 13B v1.5 model.
arXiv Detail & Related papers (2024-11-11T10:02:49Z)
Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models [8.024771725860127]
Large Language Models (LLMs) remain vulnerable to jailbreak attacks that bypass their safety mechanisms. We introduce a novel scalable jailbreak attack that preempts the activation of an LLM's safety policies by occupying its computational resources.
arXiv Detail & Related papers (2024-10-05T15:10:01Z)
Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation [20.176424063726277]
Large language models (LLMs) are extensively adapted for downstream applications through a process known as "customization" Recent studies have revealed a vulnerability that tuning LLMs with malicious samples can compromise their robustness and amplify harmful content, an attack known as "jailbreaking"
arXiv Detail & Related papers (2024-10-03T05:24:38Z)
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach [25.31933913962953]
Large Language Models (LLMs) have gained widespread use, raising concerns about their security. We introduce PathSeeker, a novel black-box jailbreak method, which is inspired by the game of rats escaping a maze. Our method outperforms five state-of-the-art attack techniques when tested across 13 commercial and open-source LLMs.
arXiv Detail & Related papers (2024-09-21T15:36:26Z)
ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2. Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z)
Can LLMs be Fooled? Investigating Vulnerabilities in LLMs [4.927763944523323]
The advent of Large Language Models (LLMs) has garnered significant popularity and wielded immense power across various domains within Natural Language Processing (NLP) This paper will synthesize the findings from each vulnerability section and propose new directions of research and development. By understanding the focal points of current vulnerabilities, we can better anticipate and mitigate future risks.
arXiv Detail & Related papers (2024-07-30T04:08:00Z)
A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends [78.3201480023907]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal understanding and reasoning tasks. The vulnerability of LVLMs is relatively underexplored, posing potential security risks in daily usage. In this paper, we provide a comprehensive review of the various forms of existing LVLM attacks.
arXiv Detail & Related papers (2024-07-10T06:57:58Z)
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks [59.46556573924901]
This paper introduces Defensive Prompt Patch (DPP), a novel prompt-based defense mechanism for large language models (LLMs) Unlike previous approaches, DPP is designed to achieve a minimal Attack Success Rate (ASR) while preserving the high utility of LLMs. Empirical results conducted on LLAMA-2-7B-Chat and Mistral-7B-Instruct-v0.2 models demonstrate the robustness and adaptability of DPP.
arXiv Detail & Related papers (2024-05-30T14:40:35Z)
Prompt Leakage effect and defense strategies for multi-turn LLM interactions [95.33778028192593]
Leakage of system prompts may compromise intellectual property and act as adversarial reconnaissance for an attacker. We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting. We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts.
arXiv Detail & Related papers (2024-04-24T23:39:58Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting [54.931241667414184]
We propose textbfAdaptive textbfShield Prompting, which prepends inputs with defense prompts to defend MLLMs against structure-based jailbreak attacks. Our methods can consistently improve MLLMs' robustness against structure-based jailbreak attacks.
arXiv Detail & Related papers (2024-03-14T15:57:13Z)
Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective [43.94115802328438]
Recent surge in jailbreaking attacks has revealed significant vulnerabilities in Large Language Models (LLMs) when exposed to malicious inputs. We suggest that the self-safeguarding capability of LLMs is linked to specific activity patterns within their representation space. Our findings demonstrate that these patterns can be detected with just a few pairs of contrastive queries.
arXiv Detail & Related papers (2024-01-12T00:50:04Z)
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities. Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content. We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.