Related papers: Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput

Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput

URL: http://arxiv.org/abs/2505.09498v1
Date: Wed, 14 May 2025 15:45:17 GMT
Title: Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput
Authors: Bo Zhang, Shuo Li, Runhe Tian, Yang Yang, Jixin Tang, Jinhao Zhou, Lin Ma,
Abstract summary: Flash-VL 2B is a novel approach to optimizing Vision-Language Models for real-time applications.<n>We show that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy.
Score: 12.996955972977986
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce Flash-VL 2B, a novel approach to optimizing Vision-Language Models (VLMs) for real-time applications, targeting ultra-low latency and high throughput without sacrificing accuracy. Leveraging advanced architectural enhancements and efficient computational strategies, Flash-VL 2B is designed to maximize throughput by reducing processing time while maintaining competitive performance across multiple vision-language benchmarks. Our approach includes tailored architectural choices, token compression mechanisms, data curation, training schemes, and a novel image processing technique called implicit semantic stitching that effectively balances computational load and model performance. Through extensive evaluations on 11 standard VLM benchmarks, we demonstrate that Flash-VL 2B achieves state-of-the-art results in both speed and accuracy, making it a promising solution for deployment in resource-constrained environments and large-scale real-time applications.

Related papers

MagicVL-2B: Empowering Vision-Language Models on Mobile Devices with Lightweight Visual Encoders via Curriculum Learning [21.12739286363107]
Vision-Language Models (VLMs) have achieved remarkable breakthroughs in recent years, enabling a diverse array of applications in everyday life.<n>We introduce MagicVL-2B, a novel VLM meticulously optimized for flagship smartphones.<n>We show that MagicVL-2B matches the accuracy of current state-of-the-art models while reducing on-device power consumption by 41.1%.
arXiv Detail & Related papers (2025-08-03T01:49:08Z)
EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models [21.42353501209045]
Vision-Language-Action (VLA) models demonstrate transformative potential for embodied intelligence but are severely hampered by high computational and memory demands.<n>We introduce EfficientVLA, a structured and training-free inference acceleration framework.<n>We apply our method to a standard VLA model CogACT, yielding a 1.93X inference speedup and reduces FLOPs to 28.9%, with only a 0.6% success rate drop in the SIMPLER benchmark.
arXiv Detail & Related papers (2025-06-11T18:34:57Z)
Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models [30.7855782696894]
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions.<n>We propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models.
arXiv Detail & Related papers (2025-05-27T13:47:18Z)
Underlying Semantic Diffusion for Effective and Efficient In-Context Learning [113.4003355229632]
Underlying Semantic Diffusion (US-Diffusion) is an enhanced diffusion model that boosts underlying semantics learning, computational efficiency, and in-context learning capabilities.<n>We present a Feedback-Aided Learning (FAL) framework, which leverages feedback signals to guide the model in capturing semantic details.<n>We also propose a plug-and-play Efficient Sampling Strategy (ESS) for dense sampling at time steps with high-noise levels.
arXiv Detail & Related papers (2025-03-06T03:06:22Z)
A Dual Process VLA: Efficient Robotic Manipulation Leveraging VLM [0.26334346517416873]
Vision-Language-Action (VLA) models enable robots to perform complex tasks by integrating visual context with linguistic commands. To overcome this, we propose Dual Process VLA (DP-VLA), a hierarchical framework inspired by dual-process theory. Experimental results on the RoboCasa dataset demonstrate that DP-VLA achieves faster inference and higher task success rates.
arXiv Detail & Related papers (2024-10-21T00:36:02Z)
SOLO: A Single Transformer for Scalable Vision-Language Modeling [74.05173379908703]
We present SOLO, a single transformer for visiOn-Language mOdeling.<n>A unified single Transformer architecture, like SOLO, effectively addresses these scalability concerns in LVLMs.<n>In this paper, we introduce the first open-source training recipe for developing SOLO, an open-source 7B LVLM.
arXiv Detail & Related papers (2024-07-08T22:40:15Z)
DPEC: Dual-Path Error Compensation Method for Enhanced Low-Light Image Clarity [2.8161423494191222]
We propose the Dual-Path Error Compensation (DPEC) method to improve image quality under low-light conditions.<n>DPEC incorporates precise pixel-level error estimation to capture subtle differences and an independent denoising mechanism to prevent noise amplification.<n> Comprehensive quantitative and qualitative experimental results demonstrate that our algorithm significantly outperforms state-of-the-art methods in low-light image enhancement.
arXiv Detail & Related papers (2024-06-28T08:21:49Z)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs. We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures [96.00848293994463]
This paper introduces Vision-RWKV, a model adapted from the RWKV model used in the NLP field.<n>Our model is designed to efficiently handle sparse inputs and demonstrate robust global processing capabilities.<n>Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage.
arXiv Detail & Related papers (2024-03-04T18:46:20Z)
ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR) ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance. We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z)
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention [100.81495948184649]
We present Perceiver-VL, a vision-and-language framework that efficiently handles high-dimensional multimodal inputs such as long videos and text. Our framework scales with linear complexity, in contrast to the quadratic complexity of self-attention used in many state-of-the-art transformer-based models.
arXiv Detail & Related papers (2022-11-21T18:22:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.