Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems
- URL: http://arxiv.org/abs/2602.05176v1
- Date: Thu, 05 Feb 2026 01:15:06 GMT
- Title: Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems
- Authors: Ziyuan Yang, Wenxuan Ding, Shangbin Feng, Yulia Tsvetkov,
- Abstract summary: Malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains.<n>We propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors.
- Score: 51.95643874494937
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.
Related papers
- The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems [55.28554025674495]
We improve efficiency while preserving the strengths of collaboration by distilling collaborative patterns into a single model.<n>We propose the single-multi evolution loop: multiple LMs collaborate, each distills from the collaborative outputs, and these post-distillation improved LMs collaborate again.
arXiv Detail & Related papers (2026-02-05T01:20:32Z) - OpenRT: An Open-Source Red Teaming Framework for Multimodal LLMs [36.57820295876294]
We introduce OpenRT, a unified, modular, and high- throughput red-teaming framework for MLLM safety evaluation.<n>At its core, OpenRT architects a paradigm shift in automated red-teaming by introducing an adversarial kernel that enables modular separation across five dimensions.<n>Our framework integrates 37 diverse attack methodologies, spanning white-box gradients, multi-modal perturbations, and sophisticated multi-agent evolutionary strategies.
arXiv Detail & Related papers (2026-01-04T16:41:33Z) - KCM: KAN-Based Collaboration Models Enhance Pretrained Large Models [62.658961779827145]
We propose a KAN-based Collaborative Model (KCM) as an improved approach to large-small model collaboration.<n>KAN offers superior visualizability and interpretability while mitigating catastrophic forgetting.
arXiv Detail & Related papers (2025-10-23T07:06:21Z) - Merge Now, Regret Later: The Hidden Cost of Model Merging is Adversarial Transferability [1.2719327447589344]
We study the effect of Model Merging (MM) on the transferability of adversarial examples.<n>We show MM cannot reliably defend against transfer attacks, with over 95% relative transfer attack success rate.<n>Our findings offer critical insights for designing more secure systems employing MM.
arXiv Detail & Related papers (2025-09-28T07:01:21Z) - LLM4MEA: Data-free Model Extraction Attacks on Sequential Recommenders via Large Language Models [50.794651919028965]
Recent studies have demonstrated the vulnerability of sequential recommender systems to Model Extraction Attacks (MEAs)<n>Black-box attacks in prior MEAs are ineffective at exposing recommender system vulnerabilities due to random sampling in data selection.<n>We propose LLM4MEA, a novel model extraction method that leverages Large Language Models (LLMs) as human-like rankers to generate data.
arXiv Detail & Related papers (2025-07-22T19:20:23Z) - Defending Deep Neural Networks against Backdoor Attacks via Module Switching [15.979018992591032]
An exponential increase in the parameters of Deep Neural Networks (DNNs) has significantly raised the cost of independent training.<n>Open-source models are more vulnerable to malicious threats, such as backdoor attacks.<n>We propose a novel module-switching strategy to break such spurious correlations within the model's propagation path.
arXiv Detail & Related papers (2025-04-08T11:01:07Z) - Towards Adversarially Robust Deep Metric Learning [0.8702432681310401]
Deep neural networks are prone to adversarial attacks and could be easily fooled by adversarial examples.<n>Existing works fail to thoroughly inspect the robustness of DML models.<n>We propose a new defense, the Ensemble Adversarial Training (EAT), which exploits ensemble learning and adversarial training.
arXiv Detail & Related papers (2025-01-02T03:15:25Z) - Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace [15.457992715866995]
We propose a novel Defense-Aware Merging (DAM) approach that simultaneously mitigates task interference and backdoor vulnerabilities.<n>Compared to existing merging methods, DAM achieves a more favorable balance between performance and security, reducing the attack success rate by 2-10 percentage points.
arXiv Detail & Related papers (2024-10-17T00:13:31Z) - Here's a Free Lunch: Sanitizing Backdoored Models with Model Merge [17.3048898399324]
democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies.
backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability.
This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities.
arXiv Detail & Related papers (2024-02-29T16:37:08Z) - Effective Backdoor Mitigation in Vision-Language Models Depends on the Pre-training Objective [71.39995120597999]
Modern machine learning models are vulnerable to adversarial and backdoor attacks.<n>Such risks are heightened by the prevalent practice of collecting massive, internet-sourced datasets for training multimodal models.<n>CleanCLIP is the current state-of-the-art approach to mitigate the effects of backdooring in multimodal models.
arXiv Detail & Related papers (2023-11-25T06:55:13Z) - ML-Doctor: Holistic Risk Assessment of Inference Attacks Against Machine
Learning Models [64.03398193325572]
Inference attacks against Machine Learning (ML) models allow adversaries to learn about training data, model parameters, etc.
We concentrate on four attacks - namely, membership inference, model inversion, attribute inference, and model stealing.
Our analysis relies on a modular re-usable software, ML-Doctor, which enables ML model owners to assess the risks of deploying their models.
arXiv Detail & Related papers (2021-02-04T11:35:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.