VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service
- URL: http://arxiv.org/abs/2506.15755v1
- Date: Wed, 18 Jun 2025 08:57:17 GMT
- Title: VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service
- Authors: Xiasi Wang, Tianliang Yao, Simin Chen, Runqi Wang, Lei YE, Kuofeng Gao, Yi Huang, Yuan Yao,
- Abstract summary: VLMInferSlow is a novel approach for evaluating VLM efficiency robustness in a realistic black-box setting.<n>We show that VLMInferSlow generates adversarial images with imperceptible perturbations, increasing the computational cost by up to 128.47%.
- Score: 11.715844075786958
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Models (VLMs) have demonstrated great potential in real-world applications. While existing research primarily focuses on improving their accuracy, the efficiency remains underexplored. Given the real-time demands of many applications and the high inference overhead of VLMs, efficiency robustness is a critical issue. However, previous studies evaluate efficiency robustness under unrealistic assumptions, requiring access to the model architecture and parameters -- an impractical scenario in ML-as-a-service settings, where VLMs are deployed via inference APIs. To address this gap, we propose VLMInferSlow, a novel approach for evaluating VLM efficiency robustness in a realistic black-box setting. VLMInferSlow incorporates fine-grained efficiency modeling tailored to VLM inference and leverages zero-order optimization to search for adversarial examples. Experimental results show that VLMInferSlow generates adversarial images with imperceptible perturbations, increasing the computational cost by up to 128.47%. We hope this research raises the community's awareness about the efficiency robustness of VLMs.
Related papers
- Event-Priori-Based Vision-Language Model for Efficient Visual Understanding [13.540340702321911]
Event-Priori-Based Vision-Language Model (EP-VLM) improves VLM inference efficiency.<n>EP-VLM uses motion priors derived from dynamic event vision to enhance VLM efficiency.
arXiv Detail & Related papers (2025-06-09T10:45:35Z) - RARL: Improving Medical VLM Reasoning and Generalization with Reinforcement Learning and LoRA under Data and Hardware Constraints [0.0]
Reasoning-Aware Reinforcement Learning framework enhances the reasoning capabilities of medical vision-language models.<n>Our approach fine-tunes a lightweight base model, Qwen2-VL-2B-Instruct, using Low-Rank Adaptation and custom reward functions.<n> Experimental results show RARL significantly improves VLM performance in medical image analysis and clinical reasoning.
arXiv Detail & Related papers (2025-06-07T00:26:23Z) - EffiVLM-BENCH: A Comprehensive Benchmark for Evaluating Training-Free Acceleration in Large Vision-Language Models [19.344130974979503]
Large Vision-Language Models (LVLMs) have achieved remarkable success, yet their significant computational demands hinder practical deployment.<n>We introduce EffiVLM-Bench, a unified framework for assessing not only absolute performance but also generalization and loyalty.<n>Our experiments and in-depth analyses offer insights into optimal strategies for accelerating LVLMs.
arXiv Detail & Related papers (2025-05-31T09:10:43Z) - VLM Can Be a Good Assistant: Enhancing Embodied Visual Tracking with Self-Improving Vision-Language Models [34.60772103760521]
We introduce a novel framework that enhances Embodied Visual Tracking (EVT) with Vision-Language Models (VLMs)<n>This work represents the first integration of VLM-based reasoning to assist EVT agents in proactive failure recovery.
arXiv Detail & Related papers (2025-05-27T04:53:50Z) - Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought [58.321044666612174]
Vad-R1 is an end-to-end MLLM-based framework for Video Anomaly Reasoning.<n>We design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies.<n>We also propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs.
arXiv Detail & Related papers (2025-05-26T12:05:16Z) - Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images.<n>Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives.<n>We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z) - DWIM: Towards Tool-aware Visual Reasoning via Discrepancy-aware Workflow Generation & Instruct-Masking Tuning [57.285435980459205]
compositional visual reasoning approaches have shown promise as more effective strategies than end-to-end VR methods.<n>We propose DWIM: Discrepancy-aware training generation, which assesses tool usage and extracts more viable for training.<n>Instruct-Masking fine-tuning, which guides the model to only clone effective actions, enabling the generation of more practical solutions.
arXiv Detail & Related papers (2025-03-25T01:57:59Z) - Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension [95.63899307791665]
Vision Value Model (VisVM) can guide VLM inference-time search to generate responses with better visual comprehension.<n>In this paper, we present VisVM that can guide VLM inference-time search to generate responses with better visual comprehension.
arXiv Detail & Related papers (2024-12-04T20:35:07Z) - Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset [92.99416966226724]
We introduce Facial Identity Unlearning Benchmark (FIUBench), a novel VLM unlearning benchmark designed to robustly evaluate the effectiveness of unlearning algorithms.<n>We apply a two-stage evaluation pipeline that is designed to precisely control the sources of information and their exposure levels.<n>Through the evaluation of four baseline VLM unlearning algorithms within FIUBench, we find that all methods remain limited in their unlearning performance.
arXiv Detail & Related papers (2024-11-05T23:26:10Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.