Related papers: Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models

Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models

URL: http://arxiv.org/abs/2503.20807v1
Date: Mon, 24 Mar 2025 20:41:57 GMT
Title: Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models
Authors: Pin-Yu Chen, Han Shen, Payel Das, Tianyi Chen,
Abstract summary: Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs.<n>This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies.
Score: 92.38300626647342
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. However, it has been empirically observed that this approach to enhancing capability inevitably compromises safety, a phenomenon also known as the safety-capability trade-off in LLM fine-tuning. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies, providing new insights into the effects of data similarity, context overlap, and alignment loss landscape. Our theoretical results characterize the fundamental limits of the safety-capability trade-off in LLM fine-tuning, which are also validated by numerical experiments.

Related papers

Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)
Understanding and Rectifying Safety Perception Distortion in VLMs [19.239094089025095]
Vision-language models (VLMs) become more susceptible to harmful requests and jailbreak attacks after integrating the vision modality.<n> multimodal inputs introduce an modality-induced activation shift toward a "safer" direction compared to their text-only counterparts.<n>We propose ShiftDC, a training-free method that decomposes and calibrates the modality-induced activation shift to reduce the impact of modality on safety.
arXiv Detail & Related papers (2025-02-18T18:06:48Z)
Activation Approximations Can Incur Safety Vulnerabilities Even in Aligned LLMs: Comprehensive Analysis and Defense [44.01174462291761]
Large Language Models (LLMs) have showcased remarkable capabilities across various domains.<n> activation approximation has emerged as a promising avenue for pursuing inference efficiency.<n>Despite achieving substantial speedups with minimal impact on utility, the safety implications of activation approximations remain unclear.
arXiv Detail & Related papers (2025-02-02T16:25:48Z)
LLM Safety Alignment is Divergence Estimation in Disguise [18.31821426379304]
We show that alignment methods function as divergence estimators between aligned (preferred or safe) and unaligned (less-preferred or harmful) distributions.<n>Inspired by the theoretical results, we identify that some alignment methods are better than others in terms of separation.<n>We advocate for compliance-refusal datasets over preference datasets to enhance safety alignment.
arXiv Detail & Related papers (2025-02-02T04:09:42Z)
Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity [61.48338027901318]
We show that fine-tuning with LLM-generated data improves target task performance and reduces out-of-domain degradation.<n>This is the first mechanistic explanation for the superior OOD robustness conferred by LLM-generated training data.
arXiv Detail & Related papers (2025-01-24T08:18:56Z)
You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense [34.023473699165315]
We study the utility degradation, safety elevation, and exaggerated-safety escalation of LLMs with jailbreak defense strategies.<n>We find that mainstream jailbreak defenses fail to ensure both safety and performance simultaneously.
arXiv Detail & Related papers (2025-01-21T15:24:29Z)
On the Impact of Fine-Tuning on Chain-of-Thought Reasoning [26.11408084129897]
This study investigates the effect of fine-tuning on the reasoning abilities of large language models. It addresses questions regarding the impact of task-specific fine-tuning on overall reasoning capabilities, the influence of fine-tuning on Chain-of-Thought (CoT) reasoning performance, and the implications for the faithfulness of CoT reasonings.
arXiv Detail & Related papers (2024-11-22T23:54:37Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)
Empowering Autonomous Driving with Large Language Models: A Safety Perspective [82.90376711290808]
This paper explores the integration of Large Language Models (LLMs) into Autonomous Driving systems. LLMs are intelligent decision-makers in behavioral planning, augmented with a safety verifier shield for contextual safety learning. We present two key studies in a simulated environment: an adaptive LLM-conditioned Model Predictive Control (MPC) and an LLM-enabled interactive behavior planning scheme with a state machine.
arXiv Detail & Related papers (2023-11-28T03:13:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.