Related papers: Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards

Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards

URL: http://arxiv.org/abs/2505.16789v2
Date: Mon, 28 Jul 2025 05:16:47 GMT
Title: Accidental Vulnerability: Factors in Fine-Tuning that Shift Model Safeguards
Authors: Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin,
Abstract summary: Large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern.<n>In this work, we investigate Accidental Vulnerability, unexpected vulnerabilities arising from characteristics of fine-tuning data.
Score: 13.197807179926428
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: As large language models (LLMs) gain popularity, their vulnerability to adversarial attacks emerges as a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can inadvertently introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Vulnerability, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity across multiple experimental datasets. We then evaluate the adversarial robustness of these fine-tuned models, analyzing persona shifts and interpretability traits to understand how dataset factors contribute to attack success rates. Lastly, we explore causal relationships that offer new insights into adversarial defense strategies, highlighting the crucial role of dataset design in preserving model alignment. Our code is available at https://github.com/psyonp/accidental_vulnerability.

Related papers

Exploiting Edge Features for Transferable Adversarial Attacks in Distributed Machine Learning [54.26807397329468]
This work explores a previously overlooked vulnerability in distributed deep learning systems.<n>An adversary who intercepts the intermediate features transmitted between them can still pose a serious threat.<n>We propose an exploitation strategy specifically designed for distributed settings.
arXiv Detail & Related papers (2025-07-09T20:09:00Z)
Preference Learning for AI Alignment: a Causal Perspective [55.2480439325792]
We frame this problem in a causal paradigm, providing the rich toolbox of causality to identify persistent challenges.<n>Inheriting from the literature of causal inference, we identify key assumptions necessary for reliable generalisation.<n>We illustrate failure modes of naive reward models and demonstrate how causally-inspired approaches can improve model robustness.
arXiv Detail & Related papers (2025-06-06T10:45:42Z)
Detecting Instruction Fine-tuning Attack on Language Models with Influence Function [6.760293300577228]
Instruction fine-tuning attacks undermine model alignment and pose security risks in real-world deployment.<n>We present a simple and effective approach to detect and mitigate such attacks using influence functions.<n>We are the first to apply influence functions for detecting language model instruction fine-tuning attacks on large-scale datasets.
arXiv Detail & Related papers (2025-04-12T00:50:28Z)
The dark deep side of DeepSeek: Fine-tuning attacks against the safety alignment of CoT-enabled models [10.524960491460945]
Fine-tuning attacks can exploit large language models to reveal potentially harmful behaviours.<n>This paper investigates the performance of the Chain of Thought based reasoning model, DeepSeek, when subjected to fine-tuning attacks.<n>We aim to shed light on the vulnerability of Chain of Thought enabled models to fine-tuning attacks and the implications for their safety and ethical deployment.
arXiv Detail & Related papers (2025-02-03T10:28:26Z)
Investigating Imperceptibility of Adversarial Attacks on Tabular Data: An Empirical Analysis [1.6693963355435217]
Adversarial attacks are a potential threat to machine learning models. These attacks cause incorrect predictions through imperceptible perturbations to the input data. This study proposes a set of key properties and corresponding metrics to assess the imperceptibility of adversarial attacks.
arXiv Detail & Related papers (2024-07-16T07:55:25Z)
Privacy Backdoors: Enhancing Membership Inference through Poisoning Pre-trained Models [112.48136829374741]
In this paper, we unveil a new vulnerability: the privacy backdoor attack. When a victim fine-tunes a backdoored model, their training data will be leaked at a significantly higher rate than if they had fine-tuned a typical model. Our findings highlight a critical privacy concern within the machine learning community and call for a reevaluation of safety protocols in the use of open-source pre-trained models.
arXiv Detail & Related papers (2024-04-01T16:50:54Z)
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [56.622250514119294]
In contrast to white-box adversarial attacks, transfer attacks are more reflective of real-world scenarios. We propose a self-augment-based transfer attack method, termed SA-Attack.
arXiv Detail & Related papers (2023-12-08T09:08:50Z)
Unveiling Safety Vulnerabilities of Large Language Models [4.562678399685183]
This paper introduces a unique dataset containing adversarial examples in the form of questions, which we call AttaQ. We assess the efficacy of our dataset by analyzing the vulnerabilities of various models when subjected to it. We introduce a novel automatic approach for identifying and naming vulnerable semantic regions.
arXiv Detail & Related papers (2023-11-07T16:50:33Z)
A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning. We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z)
Clustering Effect of (Linearized) Adversarial Robust Models [60.25668525218051]
We propose a novel understanding of adversarial robustness and apply it on more tasks including domain adaption and robustness boosting. Experimental evaluations demonstrate the rationality and superiority of our proposed clustering strategy.
arXiv Detail & Related papers (2021-11-25T05:51:03Z)
Enhancing Model Robustness and Fairness with Causality: A Regularization Approach [15.981724441808147]
Recent work has raised concerns on the risk of spurious correlations and unintended biases in machine learning models. We propose a simple and intuitive regularization approach to integrate causal knowledge during model training. We build a predictive model that relies more on causal features and less on non-causal features.
arXiv Detail & Related papers (2021-10-03T02:49:33Z)
Feature Importance-aware Transferable Adversarial Attacks [46.12026564065764]
Existing transferable attacks tend to craft adversarial examples by indiscriminately distorting features. We argue that such brute-force degradation would introduce model-specific local optimum into adversarial examples. By contrast, we propose the Feature Importance-aware Attack (FIA), which disrupts important object-aware features.
arXiv Detail & Related papers (2021-07-29T17:13:29Z)
Explainable Adversarial Attacks in Deep Neural Networks Using Activation Profiles [69.9674326582747]
This paper presents a visual framework to investigate neural network models subjected to adversarial examples. We show how observing these elements can quickly pinpoint exploited areas in a model.
arXiv Detail & Related papers (2021-03-18T13:04:21Z)
Firearm Detection via Convolutional Neural Networks: Comparing a Semantic Segmentation Model Against End-to-End Solutions [68.8204255655161]
Threat detection of weapons and aggressive behavior from live video can be used for rapid detection and prevention of potentially deadly incidents. One way for achieving this is through the use of artificial intelligence and, in particular, machine learning for image analysis. We compare a traditional monolithic end-to-end deep learning model and a previously proposed model based on an ensemble of simpler neural networks detecting fire-weapons via semantic segmentation.
arXiv Detail & Related papers (2020-12-17T15:19:29Z)
On the Transferability of Adversarial Attacksagainst Neural Text Classifier [121.6758865857686]
We investigate the transferability of adversarial examples for text classification models. We propose a genetic algorithm to find an ensemble of models that can induce adversarial examples to fool almost all existing models. We derive word replacement rules that can be used for model diagnostics from these adversarial examples.
arXiv Detail & Related papers (2020-11-17T10:45:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.