V"Mean"ba: Visual State Space Models only need 1 hidden dimension
- URL: http://arxiv.org/abs/2412.16602v1
- Date: Sat, 21 Dec 2024 12:27:07 GMT
- Title: V"Mean"ba: Visual State Space Models only need 1 hidden dimension
- Authors: Tien-Yu Chi, Hung-Yueh Chiang, Chi-Chih Chang, Ning-Chi Huang, Kai-Chiang Wu,
- Abstract summary: State Space Models (SSMs) have emerged as a solution by introducing a linear recurrence mechanism.
We introduce textitVMeanba, a training-free compression method that eliminates the channel dimension in SSMs using mean operations.
We show that textitVMeanba achieves up to a 1.12x speedup with less than a 3% accuracy loss.
- Score: 0.7864304771129751
- License:
- Abstract: Vision transformers dominate image processing tasks due to their superior performance. However, the quadratic complexity of self-attention limits the scalability of these systems and their deployment on resource-constrained devices. State Space Models (SSMs) have emerged as a solution by introducing a linear recurrence mechanism, which reduces the complexity of sequence modeling from quadratic to linear. Recently, SSMs have been extended to high-resolution vision tasks. Nonetheless, the linear recurrence mechanism struggles to fully utilize matrix multiplication units on modern hardware, resulting in a computational bottleneck. We address this issue by introducing \textit{VMeanba}, a training-free compression method that eliminates the channel dimension in SSMs using mean operations. Our key observation is that the output activations of SSM blocks exhibit low variances across channels. Our \textit{VMeanba} leverages this property to optimize computation by averaging activation maps across the channel to reduce the computational overhead without compromising accuracy. Evaluations on image classification and semantic segmentation tasks demonstrate that \textit{VMeanba} achieves up to a 1.12x speedup with less than a 3\% accuracy loss. When combined with 40\% unstructured pruning, the accuracy drop remains under 3\%.
Related papers
- MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices [24.1144641404561]
We propose a scheme for exact attention inference acceleration on memory-constrained edge accelerators.
We show up to 2.75x speedup and 54% reduction in energy consumption as compared to the state-of-the-art attention fusion method (FLAT) in the edge computing scenario.
arXiv Detail & Related papers (2024-11-20T19:44:26Z) - ALOcc: Adaptive Lifting-based 3D Semantic Occupancy and Cost Volume-based Flow Prediction [89.89610257714006]
Existing methods prioritize higher accuracy to cater to the demands of these tasks.
We introduce a series of targeted improvements for 3D semantic occupancy prediction and flow estimation.
Our purelytemporalal architecture framework, named ALOcc, achieves an optimal tradeoff between speed and accuracy.
arXiv Detail & Related papers (2024-11-12T11:32:56Z) - Cross-Scan Mamba with Masked Training for Robust Spectral Imaging [51.557804095896174]
We propose the Cross-Scanning Mamba, named CS-Mamba, that employs a Spatial-Spectral SSM for global-local balanced context encoding.
Experiment results show that our CS-Mamba achieves state-of-the-art performance and the masked training method can better reconstruct smooth features to improve the visual quality.
arXiv Detail & Related papers (2024-08-01T15:14:10Z) - Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions [27.111140222002653]
This paper investigates the role of CLIP image embeddings within the Stable Video Diffusion (SVD) framework.
We introduce the VCUT, a training-free approach optimized for efficiency within the SVD architecture.
The implementation of VCUT leads to a reduction of up to 322T Multiple-Accumulate Operations (MACs) per video and a decrease in model parameters by up to 50M, achieving a 20% reduction in latency compared to the baseline.
arXiv Detail & Related papers (2024-07-27T08:21:14Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
We propose CHELA, which replaces state space models with short-long convolutions and implements linear attention in a divide-and-conquer manner.
Our experiments on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-06-12T12:12:38Z) - Look-Around Before You Leap: High-Frequency Injected Transformer for Image Restoration [46.96362010335177]
In this paper, we propose HIT, a simple yet effective High-frequency Injected Transformer for image restoration.
Specifically, we design a window-wise injection module (WIM), which incorporates abundant high-frequency details into the feature map, to provide reliable references for restoring high-quality images.
In addition, we introduce a spatial enhancement unit (SEU) to preserve essential spatial relationships that may be lost due to the computations carried out across channel dimensions in the BIM.
arXiv Detail & Related papers (2024-03-30T08:05:00Z) - Low-Resolution Self-Attention for Semantic Segmentation [93.30597515880079]
We introduce the Low-Resolution Self-Attention (LRSA) mechanism to capture global context at a significantly reduced computational cost.
Our approach involves computing self-attention in a fixed low-resolution space regardless of the input image's resolution.
We demonstrate the effectiveness of our LRSA approach by building the LRFormer, a vision transformer with an encoder-decoder structure.
arXiv Detail & Related papers (2023-10-08T06:10:09Z) - Effective Invertible Arbitrary Image Rescaling [77.46732646918936]
Invertible Neural Networks (INN) are able to increase upscaling accuracy significantly by optimizing the downscaling and upscaling cycle jointly.
A simple and effective invertible arbitrary rescaling network (IARN) is proposed to achieve arbitrary image rescaling by training only one model in this work.
It is shown to achieve a state-of-the-art (SOTA) performance in bidirectional arbitrary rescaling without compromising perceptual quality in LR outputs.
arXiv Detail & Related papers (2022-09-26T22:22:30Z) - Indirect-Instant Attention Optimization for Crowd Counting in Dense
Scenes [3.8950254639440094]
Indirect-Instant Attention Optimization (IIAO) module based on SoftMax-Attention.
Special transformation yields relatively coarse features and, originally, the predictive fallibility of regions varies by crowd density distribution.
We tailor the Regional Correlation Loss (RCLoss) to retrieve continuous error-prone regions and smooth spatial information.
arXiv Detail & Related papers (2022-06-12T03:29:50Z) - Algorithm to Compilation Co-design: An Integrated View of Neural Network
Sparsity [0.8566457170664925]
We apply structured and unstructured pruning to attention weights of transformer blocks of the BERT language model.
We study relationships between modeling decisions and their direct impact on sparsity-enhanced execution.
arXiv Detail & Related papers (2021-06-16T15:13:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.