Related papers: On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations

On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations

URL: http://arxiv.org/abs/2507.22398v1
Date: Wed, 30 Jul 2025 05:41:29 GMT
Title: On the Reliability of Vision-Language Models Under Adversarial Frequency-Domain Perturbations
Authors: Jordan Vice, Naveed Akhtar, Yansong Gao, Richard Hartley, Ajmal Mian,
Abstract summary: Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning.<n>We show how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks.
Score: 53.611451075703314
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) are increasingly used as perceptual modules for visual content reasoning, including through captioning and DeepFake detection. In this work, we expose a critical vulnerability of VLMs when exposed to subtle, structured perturbations in the frequency domain. Specifically, we highlight how these feature transformations undermine authenticity/DeepFake detection and automated image captioning tasks. We design targeted image transformations, operating in the frequency domain to systematically adjust VLM outputs when exposed to frequency-perturbed real and synthetic images. We demonstrate that the perturbation injection method generalizes across five state-of-the-art VLMs which includes different-parameter Qwen2/2.5 and BLIP models. Experimenting across ten real and generated image datasets reveals that VLM judgments are sensitive to frequency-based cues and may not wholly align with semantic content. Crucially, we show that visually-imperceptible spatial frequency transformations expose the fragility of VLMs deployed for automated image captioning and authenticity detection tasks. Our findings under realistic, black-box constraints challenge the reliability of VLMs, underscoring the need for robust multimodal perception systems.

Related papers

AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z)
Unlocking the Capabilities of Large Vision-Language Models for Generalizable and Explainable Deepfake Detection [18.125287697902813]
Current Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in understanding multimodal data.<n>We present a novel framework that unlocks LVLMs' potential capabilities for deepfake detection.
arXiv Detail & Related papers (2025-03-19T03:20:03Z)
Semantic-Clipping: Efficient Vision-Language Modeling with Semantic-Guidedd Visual Selection [53.558449071113245]
Vision-Language Models (VLMs) leverage aligned visual encoders to transform images into visual tokens, allowing them to be processed similarly to text by the backbone large language model (LLM)<n>Recent advancements in vision-language modeling introduce image cropping techniques that feed all encoded sub-images into the model.<n>We propose a lightweight, universal framework that seamlessly integrates with existing VLMs to enhance their ability to process finegrained details.
arXiv Detail & Related papers (2025-03-14T18:33:31Z)
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models [0.6715525121432597]
This research presents a novel vision language model (VLM) framework to enhance feature extraction, scalability, and efficiency.<n>We evaluate the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise.<n>Our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV)
arXiv Detail & Related papers (2025-03-08T01:22:10Z)
FE-UNet: Frequency Domain Enhanced U-Net for Low-Frequency Information-Rich Image Segmentation [48.034848981295525]
We address the differences in frequency band sensitivity between CNNs and the human visual system.<n>We propose a wavelet adaptive spectrum fusion (WASF) method inspired by biological vision mechanisms to balance cross-frequency image features.<n>We develop the FE-UNet model, which employs a SAM2 backbone network and incorporates fine-tuned Hiera-Large modules to ensure segmentation accuracy.
arXiv Detail & Related papers (2025-02-06T07:24:34Z)
VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models [20.92507667350599]
We introduce a verbalized learning framework named VERA that enables vision-language models to perform video anomaly detection.<n> VERA decomposes the complex reasoning required for VAD into reflections on simpler, more focused guiding questions.<n>During inference, VERA embeds the learned questions into model prompts to guide VLMs in generating segment-level anomaly scores.
arXiv Detail & Related papers (2024-12-02T04:10:14Z)
VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models [19.291697178628546]
Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks. In this work, we propose an eye examination process to investigate how a VLM perceives images.
arXiv Detail & Related papers (2024-09-23T07:15:29Z)
BRAVE: Broadening the visual encoding of vision-language models [48.41146184575914]
Vision-language models (VLMs) are composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several shortcomings due to the limited capabilities of vision encoders. We introduce BRAVE, which consolidates features from multiple frozen encoders into a more versatile representation that can be directly fed as the input to a frozen LM.
arXiv Detail & Related papers (2024-04-10T17:59:45Z)
MouSi: Poly-Visual-Expert Vision-Language Models [132.58949014605477]
This paper proposes the use of ensemble experts technique to synergize the capabilities of individual visual encoders. This technique introduces a fusion network to unify the processing of outputs from different visual experts. In our implementation, this technique significantly reduces the positional occupancy in models like SAM, from a substantial 4096 to a more efficient and manageable 64 or even down to 1.
arXiv Detail & Related papers (2024-01-30T18:09:11Z)
Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity [45.86789047206224]
This paper presents novel benchmarks for evaluating vision-language models (VLMs) in zero-shot recognition. Our benchmarks test VLMs' consistency in understanding concepts across semantic granularity levels and their response to varying text specificity. Findings show that VLMs favor moderately fine-grained concepts and struggle with specificity, often misjudging texts that differ from their training data.
arXiv Detail & Related papers (2023-06-28T09:29:06Z)
Masked Frequency Modeling for Self-Supervised Visual Pre-Training [102.89756957704138]
We present Masked Frequency Modeling (MFM), a unified frequency-domain-based approach for self-supervised pre-training of visual models. MFM first masks out a portion of frequency components of the input image and then predicts the missing frequencies on the frequency spectrum. For the first time, MFM demonstrates that, for both ViT and CNN, a simple non-Siamese framework can learn meaningful representations even using none of the following: (i) extra data, (ii) extra model, (iii) mask token.
arXiv Detail & Related papers (2022-06-15T17:58:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.