Related papers: Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples

Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples

URL: http://arxiv.org/abs/2404.16557v1
Date: Thu, 25 Apr 2024 12:11:38 GMT
Title: Energy-Latency Manipulation of Multi-modal Large Language Models via Verbose Samples
Authors: Kuofeng Gao, Jindong Gu, Yang Bai, Shu-Tao Xia, Philip Torr, Wei Liu, Zhifeng Li,
Abstract summary: In this paper, we aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation. We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences. Experiments demonstrate that our verbose samples can largely extend the length of generated sequences.
Score: 63.9198662100875
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the exceptional performance of multi-modal large language models (MLLMs), their deployment requires substantial computational resources. Once malicious users induce high energy consumption and latency time (energy-latency cost), it will exhaust computational resources and harm availability of service. In this paper, we investigate this vulnerability for MLLMs, particularly image-based and video-based ones, and aim to induce high energy-latency cost during inference by crafting an imperceptible perturbation. We find that high energy-latency cost can be manipulated by maximizing the length of generated sequences, which motivates us to propose verbose samples, including verbose images and videos. Concretely, two modality non-specific losses are proposed, including a loss to delay end-of-sequence (EOS) token and an uncertainty loss to increase the uncertainty over each generated token. In addition, improving diversity is important to encourage longer responses by increasing the complexity, which inspires the following modality specific loss. For verbose images, a token diversity loss is proposed to promote diverse hidden states. For verbose videos, a frame feature diversity loss is proposed to increase the feature diversity among frames. To balance these losses, we propose a temporal weight adjustment algorithm. Experiments demonstrate that our verbose samples can largely extend the length of generated sequences.

Related papers

ClearSight: Visual Signal Enhancement for Object Hallucination Mitigation in Multimodal Large language Models [28.24397677839652]
Contrastive decoding strategies are widely used to mitigate object hallucinations in multimodal large language models (MLLMs) We propose Visual Amplification Fusion (VAF), a plug-and-play technique that enhances attention to visual signals within the model's middle layers. VAF significantly reduces hallucinations across various MLLMs without affecting inference speed, while maintaining coherence and accuracy in generated outputs.
arXiv Detail & Related papers (2025-03-17T12:30:40Z)
Learning Free Token Reduction for Multi-Modal Large Language Models [3.4026156483879517]
Vision-Language Models (VLMs) have achieved remarkable success across a range of multimodal tasks. However, their practical deployment is often constrained by high computational costs and prolonged inference times. We propose a token compression paradigm that operates on both spatial and temporal dimensions.
arXiv Detail & Related papers (2025-01-29T02:52:32Z)
FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance [9.782362715017596]
We introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence. We analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy. FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens.
arXiv Detail & Related papers (2025-01-05T03:28:45Z)
Multimodal Latent Language Modeling with Next-Token Diffusion [111.93906046452125]
Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video) We propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers.
arXiv Detail & Related papers (2024-12-11T18:57:32Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption. compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.992614129625274]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task. It captures high-order correlations of forged images from diverse linguistic feature spaces. It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z)
Q-VLM: Post-training Quantization for Large Vision-Language Models [73.19871905102545]
We propose a post-training quantization framework of large vision-language models (LVLMs) for efficient multi-modal inference. We mine the cross-layer dependency that significantly influences discretization errors of the entire vision-language model, and embed this dependency into optimal quantization strategy. Experimental results demonstrate that our method compresses the memory by 2.78x and increase generate speed by 1.44x about 13B LLaVA model without performance degradation.
arXiv Detail & Related papers (2024-10-10T17:02:48Z)
Video Token Sparsification for Efficient Multimodal LLMs in Autonomous Driving [9.900979396513687]
Multimodal large language models (MLLMs) have demonstrated remarkable potential for enhancing scene understanding in autonomous driving systems. One major limitation arises from the large number of visual tokens required to capture fine-grained and long-context visual information. We propose Video Token Sparsification (VTS) to significantly reduce the total number of visual tokens while preserving the most salient information.
arXiv Detail & Related papers (2024-09-16T05:31:01Z)
Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection [9.145305176998447]
Weakly supervised multimodal violence detection aims to learn a violence detection model by leveraging multiple modalities. We propose a new weakly supervised MVD method that explicitly addresses the challenges of information redundancy, modality imbalance, and modality asynchrony. Experiments on the largest-scale XD-Violence dataset demonstrate that the proposed method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-05-08T15:27:08Z)
Inducing High Energy-Latency of Large Vision-Language Models with Verbose Images [63.91986621008751]
Large vision-language models (VLMs) have achieved exceptional performance across various multi-modal tasks. In this paper, we aim to induce high energy-latency cost during inference ofVLMs. We propose verbose images, with the goal of crafting an imperceptible perturbation to induce VLMs to generate long sentences.
arXiv Detail & Related papers (2024-01-20T08:46:06Z)
Exploiting Modality-Specific Features For Multi-Modal Manipulation Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks. Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment. We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z)
RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of Language Models [12.947537874888717]
varied inference latency, identified as a consequence of uncertainty intrinsic to the nature of language, can lead to computational inefficiency. We present RT-LM, an uncertainty-aware resource management ecosystem for real-time inference of LMs. We show that RT-LM can significantly reduce the average response time and improve throughput while incurring a rather small runtime overhead.
arXiv Detail & Related papers (2023-09-12T22:22:10Z)
Multi-Granularity Network with Modal Attention for Dense Affective Understanding [11.076925361793556]
In the recent EEV challenge, a dense affective understanding task is proposed and requires frame-level affective prediction. We propose a multi-granularity network with modal attention (MGN-MA), which employs multi-granularity features for better description of the target frame. The proposed method achieves the correlation score of 0.02292 in the EEV challenge.
arXiv Detail & Related papers (2021-06-18T07:37:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.