Related papers: Merge Now, Regret Later: The Hidden Cost of Model Merging is Adversarial Transferability

Merge Now, Regret Later: The Hidden Cost of Model Merging is Adversarial Transferability

URL: http://arxiv.org/abs/2509.23689v1
Date: Sun, 28 Sep 2025 07:01:21 GMT
Title: Merge Now, Regret Later: The Hidden Cost of Model Merging is Adversarial Transferability
Authors: Ankit Gangwal, Aaryan Ajay Sharma,
Abstract summary: We study the effect of Model Merging (MM) on the transferability of adversarial examples.<n>We show MM cannot reliably defend against transfer attacks, with over 95% relative transfer attack success rate.<n>Our findings offer critical insights for designing more secure systems employing MM.
Score: 1.2719327447589344
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Model Merging (MM) has emerged as a promising alternative to multi-task learning, where multiple fine-tuned models are combined, without access to tasks' training data, into a single model that maintains performance across tasks. Recent works have explored the impact of MM on adversarial attacks, particularly backdoor attacks. However, none of them have sufficiently explored its impact on transfer attacks using adversarial examples, i.e., a black-box adversarial attack where examples generated for a surrogate model successfully mislead a target model. In this work, we study the effect of MM on the transferability of adversarial examples. We perform comprehensive evaluations and statistical analysis consisting of 8 MM methods, 7 datasets, and 6 attack methods, sweeping over 336 distinct attack settings. Through it, we first challenge the prevailing notion of MM conferring free adversarial robustness, and show MM cannot reliably defend against transfer attacks, with over 95% relative transfer attack success rate. Moreover, we reveal 3 key insights for machine-learning practitioners regarding MM and transferability for a robust system design: (1) stronger MM methods increase vulnerability to transfer attacks; (2) mitigating representation bias increases vulnerability to transfer attacks; and (3) weight averaging, despite being the weakest MM method, is the most vulnerable MM method to transfer attacks. Finally, we analyze the underlying reasons for this increased vulnerability, and provide potential solutions to the problem. Our findings offer critical insights for designing more secure systems employing MM.

Related papers

Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems [51.95643874494937]
Malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains.<n>We propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors.
arXiv Detail & Related papers (2026-02-05T01:15:06Z)
Boosting Adversarial Transferability Against Defenses via Multi-Scale Transformation [0.8388591755871736]
The transferability of adversarial examples poses a significant security challenge for deep neural networks.<n>We propose a new Segmented Gaussian Pyramid (SGP) attack method to enhance the transferability.<n>In contrast to the state-of-the-art methods, SGP significantly enhances attack success rates against black-box defense models.
arXiv Detail & Related papers (2025-07-02T15:16:30Z)
A Simple DropConnect Approach to Transfer-based Targeted Attack [43.039945949426546]
We study the problem of transfer-based black-box attack, where adversarial samples generated using a single surrogate model are directly applied to target models.<n>We propose to Mitigate perturbation Co-adaptation by DropConnect to enhance transferability.<n>In the challenging scenario of transferring from a CNN-based model to Transformer-based models, MCD achieves 13% higher average ASRs compared with state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-24T12:29:23Z)
BadMerging: Backdoor Attacks Against Model Merging [17.797688491548175]
We introduce BadMerging, the first backdoor attack specifically designed for Model Merging (MM) BadMerging comprises a two-stage attack mechanism and a novel feature-interpolation-based loss to enhance the robustness of embedded backdoors. Our experiments show that BadMerging achieves remarkable attacks against various MM algorithms.
arXiv Detail & Related papers (2024-08-14T08:19:23Z)
Downstream Transfer Attack: Adversarial Attacks on Downstream Models with Pre-trained Vision Transformers [95.22517830759193]
This paper studies the transferability of such an adversarial vulnerability from a pre-trained ViT model to downstream tasks. We show that DTA achieves an average attack success rate (ASR) exceeding 90%, surpassing existing methods by a huge margin.
arXiv Detail & Related papers (2024-08-03T08:07:03Z)
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks [81.2935966933355]
We study the impact of visual adversarial attacks on Large Multimodal Models (LMMs) We find that in general LMMs are not robust to visual adversarial inputs. We propose a new approach to real-world image classification which we term query decomposition.
arXiv Detail & Related papers (2023-12-06T04:59:56Z)
Avoid Adversarial Adaption in Federated Learning by Multi-Metric Investigations [55.2480439325792]
Federated Learning (FL) facilitates decentralized machine learning model training, preserving data privacy, lowering communication costs, and boosting model performance through diversified data sources. FL faces vulnerabilities such as poisoning attacks, undermining model integrity with both untargeted performance degradation and targeted backdoor attacks. We define a new notion of strong adaptive adversaries, capable of adapting to multiple objectives simultaneously. MESAS is the first defense robust against strong adaptive adversaries, effective in real-world data scenarios, with an average overhead of just 24.37 seconds.
arXiv Detail & Related papers (2023-06-06T11:44:42Z)
Learning to Learn Transferable Attack [77.67399621530052]
Transfer adversarial attack is a non-trivial black-box adversarial attack that aims to craft adversarial perturbations on the surrogate model and then apply such perturbations to the victim model. We propose a Learning to Learn Transferable Attack (LLTA) method, which makes the adversarial perturbations more generalized via learning from both data and model augmentation. Empirical results on the widely-used dataset demonstrate the effectiveness of our attack method with a 12.85% higher success rate of transfer attack compared with the state-of-the-art methods.
arXiv Detail & Related papers (2021-12-10T07:24:21Z)
Training Meta-Surrogate Model for Transferable Adversarial Attack [98.13178217557193]
We consider adversarial attacks to a black-box model when no queries are allowed. In this setting, many methods directly attack surrogate models and transfer the obtained adversarial examples to fool the target model. We show we can obtain a Meta-Surrogate Model (MSM) such that attacks to this model can be easier transferred to other models.
arXiv Detail & Related papers (2021-09-05T03:27:46Z)
Direction-Aggregated Attack for Transferable Adversarial Examples [10.208465711975242]
A deep neural network is vulnerable to adversarial examples crafted by imposing imperceptible changes to the inputs. adversarial examples are most successful in white-box settings where the model and its parameters are available. We propose the Direction-Aggregated adversarial attacks that deliver transferable adversarial examples.
arXiv Detail & Related papers (2021-04-19T09:54:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.