Related papers: MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models

URL: http://arxiv.org/abs/2511.17448v1
Date: Fri, 21 Nov 2025 17:46:44 GMT
Title: MMT-ARD: Multimodal Multi-Teacher Adversarial Distillation for Robust Vision-Language Models
Authors: Yuqi Li, Junhao Dong, Chuanguang Yang, Shiping Wen, Piotr Koniusz, Tingwen Huang, Yingli Tian, Yew-Soon Ong,
Abstract summary: We propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Distillation framework.<n>Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimize clean feature preservation and robust feature enhancement.<n>Experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5%.
Score: 123.90007730845876
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Vision-Language Models (VLMs) are increasingly deployed in safety-critical applications, making their adversarial robustness a crucial concern. While adversarial knowledge distillation has shown promise in transferring robustness from teacher to student models, traditional single-teacher approaches suffer from limited knowledge diversity, slow convergence, and difficulty in balancing robustness and accuracy. To address these challenges, we propose MMT-ARD: a Multimodal Multi-Teacher Adversarial Robust Distillation framework. Our key innovation is a dual-teacher knowledge fusion architecture that collaboratively optimizes clean feature preservation and robust feature enhancement. To better handle challenging adversarial examples, we introduce a dynamic weight allocation strategy based on teacher confidence, enabling adaptive focus on harder samples. Moreover, to mitigate bias among teachers, we design an adaptive sigmoid-based weighting function that balances the strength of knowledge transfer across modalities. Extensive experiments on ImageNet and zero-shot benchmarks demonstrate that MMT-ARD improves robust accuracy by +4.32% and zero-shot accuracy by +3.5% on the ViT-B-32 model, while achieving a 2.3x increase in training efficiency over traditional single-teacher methods. These results highlight the effectiveness and scalability of MMT-ARD in enhancing the adversarial robustness of multimodal large models. Our codes are available at https://github.com/itsnotacie/MMT-ARD.

Related papers

From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z)
AdaSwitch: Adaptive Switching Generation for Knowledge Distillation [58.647880811071495]
Small language models (SLMs) are crucial for applications with strict latency and computational constraints.<n>We propose AdaSwitch, a novel approach that combines on-policy and off-policy generation at the token level.<n>AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.
arXiv Detail & Related papers (2025-10-09T06:38:37Z)
CIARD: Cyclic Iterative Adversarial Robustness Distillation [19.685981220232712]
Adrial robustness distillation (ARD) aims to transfer performance and robustness from teacher model to student model.<n>Existing ARD approaches enhance student model's robustness, but the inevitable by-product leads to degraded performance on clean examples.<n>We propose a novel Cyclic Iterative ARD (CIARD) method with two key innovations.
arXiv Detail & Related papers (2025-09-16T03:51:43Z)
AMMKD: Adaptive Multimodal Multi-teacher Distillation for Lightweight Vision-Language Models [35.71783914954563]
We propose a novel framework that integrates multi-modal feature fusion, multi-teacher distillation, and adaptive optimization to deliver lightweight yet effective retrieval models.<n>Experiments on three benchmark datasets demonstrate that AMMKD achieves superior performance while significantly reducing model complexity, validating its effectiveness and flexibility.
arXiv Detail & Related papers (2025-08-23T04:52:20Z)
Optimizing Robustness and Accuracy in Mixture of Experts: A Dual-Model Approach [14.639659415276533]
Mixture of Experts (MoE) have shown remarkable success in leveraging specialized expert networks for complex machine learning tasks.<n>Their susceptibility to adversarial attacks presents a critical challenge for deployment in robust applications.<n>This paper addresses the question of how to incorporate robustness into MoEs while maintaining high natural accuracy.
arXiv Detail & Related papers (2025-02-05T20:45:52Z)
Robust Modality-incomplete Anomaly Detection: A Modality-instructive Framework with Benchmark [69.02666229531322]
We introduce a pioneering study that investigates Modality-Incomplete Industrial Anomaly Detection (MIIAD)<n>We find that most existing MIAD methods perform poorly on the MIIAD Bench, leading to significant performance degradation.<n>We propose a novel two-stage Robust modAlity-aware fusing and Detecting framewoRk, abbreviated as RADAR.
arXiv Detail & Related papers (2024-10-02T16:47:55Z)
FullLoRA: Efficiently Boosting the Robustness of Pretrained Vision Transformers [72.83770102062141]
Vision Transformer (ViT) model has gradually become mainstream in various computer vision tasks.<n>Existing large models tend to prioritize performance during training, potentially neglecting the robustness.<n>We develop novel LNLoRA module, incorporating a learnable layer normalization before the conventional LoRA module.<n>We propose the FullLoRA framework by integrating the learnable LNLoRA modules into all key components of ViT-based models.
arXiv Detail & Related papers (2024-01-03T14:08:39Z)
VideoAdviser: Video Knowledge Distillation for Multimodal Transfer Learning [6.379202839994046]
Multimodal transfer learning aims to transform pretrained representations of diverse modalities into a common domain space for effective multimodal fusion. We propose VideoAdviser, a video knowledge distillation method to transfer multimodal knowledge of video-enhanced prompts from a multimodal fundamental model to a specific modal fundamental model. We evaluate our method in two challenging multimodal tasks: video-level sentiment analysis and audio-visual retrieval.
arXiv Detail & Related papers (2023-09-27T08:44:04Z)
Adversarial Contrastive Distillation with Adaptive Denoising [15.119013995045192]
We propose Contrastive Relationship DeNoise Distillation (CRDND) to boost the robustness of small models. We show CRDND can transfer robust knowledge efficiently and achieves state-of-the-art performances.
arXiv Detail & Related papers (2023-02-17T09:00:18Z)
Mutual Adversarial Training: Learning together is better than going alone [82.78852509965547]
We study how interactions among models affect robustness via knowledge distillation. We propose mutual adversarial training (MAT) in which multiple models are trained together. MAT can effectively improve model robustness and outperform state-of-the-art methods under white-box attacks.
arXiv Detail & Related papers (2021-12-09T15:59:42Z)
Softmax with Regularization: Better Value Estimation in Multi-Agent Reinforcement Learning [72.28520951105207]
Overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning. We propose a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline. We show that our method provides a consistent performance improvement on a set of challenging StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2021-03-22T14:18:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.