Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education
- URL: http://arxiv.org/abs/2511.14423v1
- Date: Tue, 18 Nov 2025 12:27:51 GMT
- Title: Unified Defense for Large Language Models against Jailbreak and Fine-Tuning Attacks in Education
- Authors: Xin Yi, Yue Li, Dongsheng Shi, Linlin Wang, Xiaoling Wang, Liang He,
- Abstract summary: Large Language Models (LLMs) are increasingly integrated into educational applications.<n>LLMs are vulnerable to jailbreak and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs.<n>We propose a three-stage shield framework (TSSF) for educational LLMs that simultaneously mitigates both jailbreak and fine-tuning attacks.
- Score: 32.70143887942455
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are increasingly integrated into educational applications. However, they remain vulnerable to jailbreak and fine-tuning attacks, which can compromise safety alignment and lead to harmful outputs. Existing studies mainly focus on general safety evaluations, with limited attention to the unique safety requirements of educational scenarios. To address this gap, we construct EduHarm, a benchmark containing safe-unsafe instruction pairs across five representative educational scenarios, enabling systematic safety evaluation of educational LLMs. Furthermore, we propose a three-stage shield framework (TSSF) for educational LLMs that simultaneously mitigates both jailbreak and fine-tuning attacks. First, safety-aware attention realignment redirects attention toward critical unsafe tokens, thereby restoring the harmfulness feature that discriminates between unsafe and safe inputs. Second, layer-wise safety judgment identifies harmfulness features by aggregating safety cues across multiple layers to detect unsafe instructions. Finally, defense-driven dual routing separates safe and unsafe queries, ensuring normal processing for benign inputs and guarded responses for harmful ones. Extensive experiments across eight jailbreak attack strategies demonstrate that TSSF effectively strengthens safety while preventing over-refusal of benign queries. Evaluations on three fine-tuning attack datasets further show that it consistently achieves robust defense against harmful queries while maintaining preserving utility gains from benign fine-tuning.
Related papers
- Attributing and Exploiting Safety Vectors through Global Optimization in Large Language Models [50.91504059485288]
We propose a framework that identifies safety-critical attention heads through global optimization over all heads simultaneously.<n>We develop a novel inference-time white-box jailbreak method that exploits the identified safety vectors through activation repatching.
arXiv Detail & Related papers (2026-01-22T09:32:43Z) - A Simple and Efficient Jailbreak Method Exploiting LLMs' Helpfulness [32.47621091096285]
Safety alignment aims to prevent Large Language Models (LLMs) from responding to harmful queries.<n>In this paper, we introduce HILL, a novel jailbreak approach that transforms imperative harmful requests into learning-style questions.<n> Experiments on the AdvBench dataset across a wide range of models demonstrate HILL's strong effectiveness, generalizability, and harmfulness.
arXiv Detail & Related papers (2025-09-17T04:21:20Z) - SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks [29.963044242980345]
Jailbreak attacks pose a serious threat to the safety of Large Language Models.<n>We propose SafeLLM, a novel unlearning-based defense framework.<n>We show that SafeLLM substantially reduces attack success rates while maintaining high general-purpose performance.
arXiv Detail & Related papers (2025-08-21T02:39:14Z) - Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - AlphaAlign: Incentivizing Safety Alignment with Extremely Simplified Reinforcement Learning [21.399086197886202]
Large language models (LLMs) possess latent safety understanding from their vast pretraining data.<n>We propose textbfAlphaAlign, a pure reinforcement learning framework with verifiable safety reward.<n>This allows the model to develop proactive safety reasoning capabilities without depending on supervised safety-specific reasoning data.
arXiv Detail & Related papers (2025-07-20T14:47:03Z) - ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning [64.32925552574115]
ARMOR is a large language model that analyzes jailbreak strategies and extracts the core intent.<n> ARMOR achieves state-of-the-art safety performance, with an average harmful rate of 0.002 and an attack success rate of 0.06 against advanced optimization-based jailbreaks.
arXiv Detail & Related papers (2025-07-14T09:05:54Z) - HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model [58.12612140992874]
We introduce a holistic safety dataset and benchmark, textbfHoliSafe, that spans all five safe/unsafe image-text combinations.<n>We also propose a novel modular framework for enhancing VLM safety with a visual guard module (VGM) designed to assess the harmfulness of input images.<n> Experiments show that Safe-VLM with VGM, trained on our HoliSafe, achieves state-of-the-art safety performance across multiple VLM benchmarks.
arXiv Detail & Related papers (2025-06-05T07:26:34Z) - SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning [76.56522719330911]
Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering.<n>LRMs pose great safety risks against harmful queries and adversarial attacks.<n>We propose SafeKey to better activate the safety aha moment in the key sentence.
arXiv Detail & Related papers (2025-05-22T03:46:03Z) - SAFER: Advancing Safety Alignment via Efficient Ex-Ante Reasoning [51.78514648677898]
We propose SAFER, a framework for Safety Alignment via eFficient Ex-Ante Reasoning.<n>Our approach instantiates structured Ex-Ante reasoning through initial assessment, rule verification, and path calibration.<n> Experiments on multiple open-source LLMs demonstrate that SAFER significantly enhances safety performance while maintaining helpfulness and response efficiency.
arXiv Detail & Related papers (2025-04-03T16:07:38Z) - Improving LLM Safety Alignment with Dual-Objective Optimization [81.98466438000086]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.<n>We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z) - SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention [14.509085965856643]
We propose SafeIntervention (SafeInt), a novel defense method that shields large language models from jailbreak attacks.<n>Built on our analysis of the representations of jailbreak samples, the core idea of SafeInt is to relocate jailbreak-related representations into the rejection region.<n>We conduct comprehensive experiments covering six jailbreak attacks, two jailbreak datasets, and two utility benchmarks.
arXiv Detail & Related papers (2025-02-21T17:12:35Z) - Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [54.10710423370126]
We propose Reasoning-to-Defend (R2D), a training paradigm that integrates a safety-aware reasoning mechanism into Large Language Models' generation process.<n>CPO enhances the model's perception of the safety status of given dialogues.<n>Experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances.
arXiv Detail & Related papers (2025-02-18T15:48:46Z) - Evaluating Defences against Unsafe Feedback in RLHF [26.872318173182414]
This paper looks at learning from unsafe feedback with reinforcement learning.<n>We find that safety-aligned LLMs easily explore unsafe action spaces via generating harmful text.<n>In order to protect against this vulnerability, we adapt a number of both "implict" and "explicit" harmful fine-tuning defences.
arXiv Detail & Related papers (2024-09-19T17:10:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.