From Pretrain to Pain: Adversarial Vulnerability of Video Foundation Models Without Task Knowledge
- URL: http://arxiv.org/abs/2511.07049v1
- Date: Mon, 10 Nov 2025 12:42:32 GMT
- Title: From Pretrain to Pain: Adversarial Vulnerability of Video Foundation Models Without Task Knowledge
- Authors: Hui Lu, Yi Yu, Song Xia, Yiming Yang, Deepu Rajan, Boon Poh Ng, Alex Kot, Xudong Jiang,
- Abstract summary: This paper investigates a novel and practical adversarial threat scenario: attacking downstream models or MLLMs fine-tuned from open-source VFMs.<n>We propose the Transferable Video Attack (TVA), a temporal-aware adversarial attack method that leverages the temporal representation dynamics of VFMs to craft effective perturbations.<n>TVA avoids the need to train expensive surrogate models or access to domain-specific data, thereby offering a more practical and efficient attack strategy.
- Score: 57.379583179331426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale Video Foundation Models (VFMs) has significantly advanced various video-related tasks, either through task-specific models or Multi-modal Large Language Models (MLLMs). However, the open accessibility of VFMs also introduces critical security risks, as adversaries can exploit full knowledge of the VFMs to launch potent attacks. This paper investigates a novel and practical adversarial threat scenario: attacking downstream models or MLLMs fine-tuned from open-source VFMs, without requiring access to the victim task, training data, model query, and architecture. In contrast to conventional transfer-based attacks that rely on task-aligned surrogate models, we demonstrate that adversarial vulnerabilities can be exploited directly from the VFMs. To this end, we propose the Transferable Video Attack (TVA), a temporal-aware adversarial attack method that leverages the temporal representation dynamics of VFMs to craft effective perturbations. TVA integrates a bidirectional contrastive learning mechanism to maximize the discrepancy between the clean and adversarial features, and introduces a temporal consistency loss that exploits motion cues to enhance the sequential impact of perturbations. TVA avoids the need to train expensive surrogate models or access to domain-specific data, thereby offering a more practical and efficient attack strategy. Extensive experiments across 24 video-related tasks demonstrate the efficacy of TVA against downstream models and MLLMs, revealing a previously underexplored security vulnerability in the deployment of video models.
Related papers
- Model Inversion Attacks on Vision-Language Models: Do They Leak What They Learn? [22.1843868052012]
Model inversion (MI) attacks pose significant privacy risks by reconstructing private training data from trained neural networks.<n>We conduct the first study to understand vision-language models (VLMs) vulnerability in leaking private visual training data.<n>We propose a suite of novel token-based and sequence-based model inversion strategies.
arXiv Detail & Related papers (2025-08-06T05:30:05Z) - CAVALRY-V: A Large-Scale Generator Framework for Adversarial Attacks on Video MLLMs [13.238196682784562]
We present CAVALRY-V (Cross-modal Language-Vision Adversarial Yielding for Videos), a novel framework that targets the critical interface between visual perception and language generation in large language models.<n>Our framework achieves flexibility through implicit temporal coherence modeling rather than explicit regularization, enabling significant performance improvements even on image understanding.
arXiv Detail & Related papers (2025-07-01T14:48:27Z) - Attacking Attention of Foundation Models Disrupts Downstream Tasks [18.92561703051693]
Foundation models are large models, trained on broad data that deliver high accuracy in many downstream tasks.<n>These models are vulnerable to adversarial attacks.<n>This paper studies the vulnerabilities of vision foundation models, focusing specifically on CLIP and ViTs.<n>We introduce a novel attack, targeting the structure of transformer-based architectures in a task-agnostic fashion.
arXiv Detail & Related papers (2025-06-03T19:42:48Z) - Vid-SME: Membership Inference Attacks against Large Video Understanding Models [56.31088116526825]
We introduce Vid-SME, the first membership inference method tailored for video data used in video understanding models (VULLMs)<n>By leveraging the SME difference between natural and temporally-reversed video frames, Vid-SME derives robust membership scores to determine whether a given video is part of the model's training set.<n> Experiments on various self-trained and open-sourced VULLMs demonstrate the strong effectiveness of Vid-SME.
arXiv Detail & Related papers (2025-05-29T13:17:25Z) - Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks [34.40254709148148]
Pre-trained vision-language models (VLMs) have showcased remarkable performance in image and natural language understanding.
Their potential safety and robustness issues raise concerns that adversaries may evade the system and cause these models to generate toxic content through malicious attacks.
We present Chain of Attack (CoA), which iteratively enhances the generation of adversarial examples based on the multi-modal semantic update.
arXiv Detail & Related papers (2024-11-24T05:28:07Z) - Transferable Adversarial Attacks on SAM and Its Downstream Models [87.23908485521439]
This paper explores the feasibility of adversarial attacking various downstream models fine-tuned from the segment anything model (SAM)<n>To enhance the effectiveness of the adversarial attack towards models fine-tuned on unknown datasets, we propose a universal meta-initialization (UMI) algorithm.
arXiv Detail & Related papers (2024-10-26T15:04:04Z) - Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models [8.943713711458633]
We propose a new attack paradigm called Feedback-based Modal Mutual Search (FMMS)
FMMS aims to push away the matched image-text pairs while randomly drawing mismatched pairs closer in feature space.
This is the first work to exploit target model feedback to explore multi-modality adversarial boundaries.
arXiv Detail & Related papers (2024-08-27T02:31:39Z) - A Survey of Attacks on Large Vision-Language Models: Resources, Advances, and Future Trends [78.3201480023907]
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a wide range of multimodal understanding and reasoning tasks.
The vulnerability of LVLMs is relatively underexplored, posing potential security risks in daily usage.
In this paper, we provide a comprehensive review of the various forms of existing LVLM attacks.
arXiv Detail & Related papers (2024-07-10T06:57:58Z) - Watch the Watcher! Backdoor Attacks on Security-Enhancing Diffusion Models [65.30406788716104]
This work investigates the vulnerabilities of security-enhancing diffusion models.
We demonstrate that these models are highly susceptible to DIFF2, a simple yet effective backdoor attack.
Case studies show that DIFF2 can significantly reduce both post-purification and certified accuracy across benchmark datasets and models.
arXiv Detail & Related papers (2024-06-14T02:39:43Z) - On Evaluating Adversarial Robustness of Large Vision-Language Models [64.66104342002882]
We evaluate the robustness of large vision-language models (VLMs) in the most realistic and high-risk setting.
In particular, we first craft targeted adversarial examples against pretrained models such as CLIP and BLIP.
Black-box queries on these VLMs can further improve the effectiveness of targeted evasion.
arXiv Detail & Related papers (2023-05-26T13:49:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.