Related papers: Extreme Model Compression for Edge Vision-Language Models: Sparse Temporal Token Fusion and Adaptive Neural Compression

Extreme Model Compression for Edge Vision-Language Models: Sparse Temporal Token Fusion and Adaptive Neural Compression

URL: http://arxiv.org/abs/2511.18504v1
Date: Sun, 23 Nov 2025 15:43:00 GMT
Title: Extreme Model Compression for Edge Vision-Language Models: Sparse Temporal Token Fusion and Adaptive Neural Compression
Authors: Md Tasnin Tanvir, Soumitra Das, Sk Md Abidar Rahaman, Ali Shiri Sichani,
Abstract summary: Two adaptive compression techniques are proposed to integrate algorithmic innovations with hardware-aware optimizations.<n>On event-based vision tasks, STTF reduces average token count by 84%.<n>ANC cuts FLOPs by up to 90% in low-motion scenes.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The demand for edge AI in vision-language tasks requires models that achieve real-time performance on resource-constrained devices with limited power and memory. This paper proposes two adaptive compression techniques -- Sparse Temporal Token Fusion (STTF) and Adaptive Neural Compression (ANC) -- that integrate algorithmic innovations with hardware-aware optimizations. Unlike previous approaches relying on static pruning or uniform scaling, STTF dynamically reuses visual tokens through event-driven change detection, while ANC conditionally activates encoder branches via a learned router, enabling fine-grained adaptation to scene complexity. Our 3B-parameter TinyGPT-STTF achieves CIDEr 131.2, BLEU-4 0.38, METEOR 0.31, and ROUGE-L 0.56 on the COCO 2017 test set, surpassing LLaVA-1.5 7B by 17.6 CIDEr points while using 2.3x fewer parameters and 62x fewer on-device FLOPs. TinyGPT-ANC reaches CIDEr 128.5. On event-based vision tasks, STTF reduces average token count by 84% (from 196 to 31 tokens) while preserving 95.6% accuracy on the DVS128 Gesture dataset, and ANC cuts FLOPs by up to 90% in low-motion scenes. Compared to strong baselines, our models improve accuracy by up to 4.4% and reduce latency by up to 13x. These results enable efficient deployment of capable vision-language models on real-world edge devices.

Related papers

EdgeFlex-Transformer: Transformer Inference for Edge Devices [2.1130318406254074]
We propose a lightweight yet effective multi-stage optimization pipeline designed to compress and accelerate Vision Transformers (ViTs)<n>Our methodology combines activation profiling, memory-aware pruning, selective mixed-precision execution, and activation-aware quantization (AWQ) to reduce the model's memory footprint without requiring costly retraining or task-specific fine-tuning.<n>Experiments on CIFAR-10 demonstrate that the fully optimized model achieves a 76% reduction in peak memory usage and over 6x lower latency, while retaining or even improving accuracy compared to the original FP32 baseline.
arXiv Detail & Related papers (2025-12-17T21:45:12Z)
FastBoost: Progressive Attention with Dynamic Scaling for Efficient Deep Learning [0.0]
We present FastBoost, a parameter-efficient neural architecture that achieves state-of-the-art performance on CIFAR benchmarks.<n>Our design establishes new efficiency frontiers with: CIFAR-10: 95.57% accuracy (0.85M parameters) and 93.80% (0.37M parameters)<n>By integrating DSPA with enhanced MBConv blocks, FastBoost achieves a 2.1 times parameter reduction over MobileNetV3 while improving accuracy by +3.2 percentage points on CIFAR-10.
arXiv Detail & Related papers (2025-11-02T17:51:36Z)
REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization [130.46612643194973]
reAR is a simple training strategy introducing a token-wise regularization objective.<n>On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standardization-based tokenizer.<n>When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M)
arXiv Detail & Related papers (2025-10-06T02:48:13Z)
A Novel Compression Framework for YOLOv8: Achieving Real-Time Aerial Object Detection on Edge Devices via Structured Pruning and Channel-Wise Distillation [0.0]
We propose a novel three-stage compression pipeline for the YOLOv8 object detection model.<n>Sparsity-aware training, structured channel pruning, and Channel-Wise Knowledge Distillation (CWD) are used.<n>Experiments on the VisDrone dataset demonstrate the effectiveness of our approach across multiple YOLOv8 variants.
arXiv Detail & Related papers (2025-09-16T10:11:59Z)
Speedy MASt3R [68.47052557089631]
MASt3R redefines image matching as a 3D task by leveraging DUSt3R and introducing a fast reciprocal matching scheme.<n>Fast MASt3R achieves a 54% reduction in inference time (198 ms to 91 ms per image pair) without sacrificing accuracy.<n>This advancement enables real-time 3D understanding, benefiting applications like mixed reality navigation and large-scale 3D scene reconstruction.
arXiv Detail & Related papers (2025-03-13T03:56:22Z)
SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer [49.1761733723771]
This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation.<n>We introduce three key innovations: Efficient Training Scaling, Model Depth Pruning, and Inference-time Scaling.<n>Through these strategies, SANA-1.5 achieves a text computation-image alignment score of 0.81 on GenEval, which can be further improved to 0.96 through inference scaling with VILA-Judge.
arXiv Detail & Related papers (2025-01-30T15:31:48Z)
SparseDiT: Token Sparsification for Efficient Diffusion Transformer [33.91304273754431]
Diffusion Transformers (DiT) are renowned for their impressive generative performance.<n>DiT is constrained by considerable computational costs due to the quadratic complexity in self-attention and the extensive sampling steps required.<n>We introduce SparseDiT, a novel framework that implements token sparsification across spatial and temporal dimensions.
arXiv Detail & Related papers (2024-12-08T18:59:16Z)
ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE) ParFormer improves feature extraction by combining convolutional and attention mechanisms. For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S. The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z)
SeTformer is What You Need for Vision and Language [26.036537788653373]
Self-optimal Transport (SeT) is a novel transformer for achieving better performance and computational efficiency. SeTformer achieves impressive top-1 accuracies of 84.7% and 86.2% on ImageNet-1K. SeTformer also achieves state-of-the-art results in language modeling on the GLUE benchmark.
arXiv Detail & Related papers (2024-01-07T16:52:49Z)
Patch-Level Contrasting without Patch Correspondence for Accurate and Dense Contrastive Representation Learning [79.43940012723539]
ADCLR is a self-supervised learning framework for learning accurate and dense vision representation. Our approach achieves new state-of-the-art performance for contrastive methods.
arXiv Detail & Related papers (2023-06-23T07:38:09Z)
AdaViT: Adaptive Tokens for Efficient Vision Transformer [91.88404546243113]
We introduce AdaViT, a method that adaptively adjusts the inference cost of vision transformer (ViT) for images of different complexity. AdaViT achieves this by automatically reducing the number of tokens in vision transformers that are processed in the network as inference proceeds.
arXiv Detail & Related papers (2021-12-14T18:56:07Z)
Non-Parametric Adaptive Network Pruning [125.4414216272874]
We introduce non-parametric modeling to simplify the algorithm design. Inspired by the face recognition community, we use a message passing algorithm to obtain an adaptive number of exemplars. EPruner breaks the dependency on the training data in determining the "important" filters.
arXiv Detail & Related papers (2021-01-20T06:18:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.