Related papers: LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments

LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments

URL: http://arxiv.org/abs/2506.07416v1
Date: Mon, 09 Jun 2025 04:30:25 GMT
Title: LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments
Authors: Jin Huang, Yuchao Jin, Le An, Josh Park,
Abstract summary: This paper introduces an efficient Vision-Language Model (VLM) pipeline optimized for deployment on embedded devices.<n>The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views.
Score: 3.5132364723753797
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces an efficient Vision-Language Model (VLM) pipeline specifically optimized for deployment on embedded devices, such as those used in robotics and autonomous driving. The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views, a token selection module to reduce input sequence length for the LLM, and speculative decoding to accelerate token generation. Evaluation on the NVIDIA DRIVE Thor platform for automonous driving application, our pipeline achieves $2.5\times$ end-to-end latency reduction without compromising task accuracy. The speed-up further increases to $3.2\times$ when applying FP8 post-training quantization. These results demonstrate our pipeline as a viable solution for enabling real-time VLM deployment in resource-constrained environments.

Related papers

PEVLM: Parallel Encoding for Vision-Language Models [4.777805570120456]
We introduce textbfPEVLM, a fine-tuning-free parallel encoding method designed to enhance the prefilling efficiency of Vision-Language Models.<n>PEVLM partitions the input video into context blocks with a shared sink block, while preserving sequential position embeddings to align the attention weight distribution with that of Full-Attention.<n>Experiments demonstrate that PEVLM consistently outperforms existing parallel encoding approaches, achieving up to textbf7.47x speedup in attention computation and reducing end-to-end latency by textbf40%.
arXiv Detail & Related papers (2025-06-24T14:14:52Z)
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding [4.734824660843965]
PipeSpec is a framework that generalizes speculative decoding to $k$ models arranged in a hierarchical pipeline.<n>We show that PipeSpec achieves up to 2.54$times$ speedup while outperforming state-of-the-art methods.
arXiv Detail & Related papers (2025-05-02T20:29:31Z)
Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines [17.539008562641303]
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. Next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands.
arXiv Detail & Related papers (2024-09-23T20:14:09Z)
AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community. We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z)
Low Latency Visual Inertial Odometry with On-Sensor Accelerated Optical Flow for Resource-Constrained UAVs [13.037162115493393]
On-sensor hardware acceleration is a promising approach to enable low latency Visual Inertial Odometry (VIO) This paper assesses the speed-up in a VIO sensor system exploiting a compact OF sensor consisting of a global shutter camera and an Application Specific Integrated Circuit (ASIC) By replacing the feature tracking logic of the VINS-Mono pipeline with data from this OF camera, we demonstrate a 49.4% reduction in latency and a 53.7% reduction of compute load of the VIO pipeline over the original VINS-Mono implementation.
arXiv Detail & Related papers (2024-06-19T08:51:19Z)
Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks [54.31708859631821]
We propose a family of operations, called routing functions, to enhance vision-language (VL) alignment in low-rank bottlenecks. In various VL PEFT settings, the routing functions significantly improve performance of the original PEFT methods.
arXiv Detail & Related papers (2024-03-14T13:27:42Z)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs. We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z)
ALTO: An Efficient Network Orchestrator for Compound AI Systems [20.880866765513066]
ALTO is a network orchestrator for efficiently serving compound AI systems such as pipelines of language models. As language models produce outputs token by token, ALTO exposes opportunities to stream intermediate outputs between stages when possible. We highlight two new challenges of correctness and load balancing which emerge when streaming intermediate data across distributed pipeline stage instances.
arXiv Detail & Related papers (2024-03-07T08:30:26Z)
Efficient NLP Inference at the Edge via Elastic Pipelining [0.42970700836450487]
WRX reconciles the latency/memory tension via two novel techniques. We build WRX and evaluate it against a range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU.
arXiv Detail & Related papers (2022-07-11T17:15:57Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
Energy-Efficient Model Compression and Splitting for Collaborative Inference Over Time-Varying Channels [52.60092598312894]
We propose a technique to reduce the total energy bill at the edge device by utilizing model compression and time-varying model split between the edge and remote nodes. Our proposed solution results in minimal energy consumption and $CO$ emission compared to the considered baselines.
arXiv Detail & Related papers (2021-06-02T07:36:27Z)
NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function Combinational Logic [4.119948826527649]
Field-programmable gate array (FPGA)-based accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms. This paper presents NullaNet Tiny, a framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators.
arXiv Detail & Related papers (2021-04-07T00:16:39Z)
EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.