LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments
- URL: http://arxiv.org/abs/2506.07416v2
- Date: Fri, 31 Oct 2025 20:18:06 GMT
- Title: LiteVLM: A Low-Latency Vision-Language Model Inference Pipeline for Resource-Constrained Environments
- Authors: Jin Huang, Yuchao Jin, Le An, Josh Park,
- Abstract summary: This paper introduces an efficient Vision-Language Model (VLM) pipeline optimized for deployment on embedded devices.<n>The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views.
- Score: 4.2226391610434275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces an efficient Vision-Language Model (VLM) pipeline specifically optimized for deployment on embedded devices, such as those used in robotics and autonomous driving. The pipeline significantly reduces the computational overhead by jointly leveraging patch selection to filter irrelevant camera views, a token selection module to reduce input sequence length for the LLM, and speculative decoding to accelerate token generation. Evaluation on the NVIDIA DRIVE Thor platform for automonous driving application, our pipeline achieves $2.5\times$ end-to-end latency reduction without compromising task accuracy. The speed-up further increases to $3.2\times$ when applying FP8 post-training quantization. These results demonstrate our pipeline as a viable solution for enabling real-time VLM deployment in resource-constrained environments.
Related papers
- Data Driven Optimization of GPU efficiency for Distributed LLM Adapter Serving [2.6336040306318274]
Large Language Model (LLM) adapters enable low-cost model specialization.<n>LLM adapters introduce complex caching and scheduling challenges in distributed serving systems where hundreds of adapters must be hosted concurrently.<n>This paper presents a data-driven pipeline that computes an adapter placement that serves the workload with the minimum number of GPU.
arXiv Detail & Related papers (2026-02-27T14:22:51Z) - A Serverless Edge-Native Data Processing Architecture for Autonomous Driving Training [0.0]
This paper introduces the framework, an edge-native platform that enables on-vehicle data filtering and processing through user-defined functions.<n>We evaluate the framework on an NVIDIA Jetson Orin Nano and compare it against native ROS 2 deployments.<n>Results show competitive performance, reduced latency and jitter, and confirm that Lambda-based abstractions can support real-time data processing in embedded autonomous driving systems.
arXiv Detail & Related papers (2026-01-30T12:41:11Z) - An LLVM-Based Optimization Pipeline for SPDZ [0.0]
We implement a proof-of-concept LLVM-based optimization pipeline for the SPDZ protocol.<n>Our front end accepts a subset of C with lightweight privacy annotations and lowers it to LLVM IR.<n>Our back end performs data-flow and control-flow analysis on the optimized IR to drive a non-blocking runtime scheduler.
arXiv Detail & Related papers (2025-12-11T20:53:35Z) - InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models [49.08289742711585]
We propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet.<n>We show that InfiniteVL achieves over 3.6times inference speedup while maintaining constant latency and memory footprint.<n>In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache.
arXiv Detail & Related papers (2025-12-09T17:18:32Z) - SpecVLM: Fast Speculative Decoding in Vision-Language Models [14.243294546325714]
Speculative decoding is a powerful way to accelerate autoregressive large language models (LLMs)<n>We study speculative decoding for vision-language models (VLMs)<n>We introduce SpecVLM, a practical system that delivers 1.5--2.3x end-to-end speedups over full autoregressive inference.
arXiv Detail & Related papers (2025-09-15T11:53:56Z) - PEVLM: Parallel Encoding for Vision-Language Models [4.777805570120456]
We introduce textbfPEVLM, a fine-tuning-free parallel encoding method designed to enhance the prefilling efficiency of Vision-Language Models.<n>PEVLM partitions the input video into context blocks with a shared sink block, while preserving sequential position embeddings to align the attention weight distribution with that of Full-Attention.<n>Experiments demonstrate that PEVLM consistently outperforms existing parallel encoding approaches, achieving up to textbf7.47x speedup in attention computation and reducing end-to-end latency by textbf40%.
arXiv Detail & Related papers (2025-06-24T14:14:52Z) - PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding [4.734824660843965]
PipeSpec is a framework that generalizes speculative decoding to $k$ models arranged in a hierarchical pipeline.<n>We show that PipeSpec achieves up to 2.54$times$ speedup while outperforming state-of-the-art methods.
arXiv Detail & Related papers (2025-05-02T20:29:31Z) - Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines [17.539008562641303]
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers.
Next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data.
Fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands.
arXiv Detail & Related papers (2024-09-23T20:14:09Z) - AdaLog: Post-Training Quantization for Vision Transformers with Adaptive Logarithm Quantizer [54.713778961605115]
Vision Transformer (ViT) has become one of the most prevailing fundamental backbone networks in the computer vision community.
We propose a novel non-uniform quantizer, dubbed the Adaptive Logarithm AdaLog (AdaLog) quantizer.
arXiv Detail & Related papers (2024-07-17T18:38:48Z) - Low Latency Visual Inertial Odometry with On-Sensor Accelerated Optical Flow for Resource-Constrained UAVs [13.037162115493393]
On-sensor hardware acceleration is a promising approach to enable low latency Visual Inertial Odometry (VIO)
This paper assesses the speed-up in a VIO sensor system exploiting a compact OF sensor consisting of a global shutter camera and an Application Specific Integrated Circuit (ASIC)
By replacing the feature tracking logic of the VINS-Mono pipeline with data from this OF camera, we demonstrate a 49.4% reduction in latency and a 53.7% reduction of compute load of the VIO pipeline over the original VINS-Mono implementation.
arXiv Detail & Related papers (2024-06-19T08:51:19Z) - Introducing Routing Functions to Vision-Language Parameter-Efficient Fine-Tuning with Low-Rank Bottlenecks [54.31708859631821]
We propose a family of operations, called routing functions, to enhance vision-language (VL) alignment in low-rank bottlenecks.
In various VL PEFT settings, the routing functions significantly improve performance of the original PEFT methods.
arXiv Detail & Related papers (2024-03-14T13:27:42Z) - An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models [65.37846460916042]
We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs.
We introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency.
arXiv Detail & Related papers (2024-03-11T14:35:32Z) - ALTO: An Efficient Network Orchestrator for Compound AI Systems [20.880866765513066]
ALTO is a network orchestrator for efficiently serving compound AI systems such as pipelines of language models.
As language models produce outputs token by token, ALTO exposes opportunities to stream intermediate outputs between stages when possible.
We highlight two new challenges of correctness and load balancing which emerge when streaming intermediate data across distributed pipeline stage instances.
arXiv Detail & Related papers (2024-03-07T08:30:26Z) - Efficient NLP Inference at the Edge via Elastic Pipelining [0.42970700836450487]
WRX reconciles the latency/memory tension via two novel techniques.
We build WRX and evaluate it against a range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU.
arXiv Detail & Related papers (2022-07-11T17:15:57Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Energy-Efficient Model Compression and Splitting for Collaborative
Inference Over Time-Varying Channels [52.60092598312894]
We propose a technique to reduce the total energy bill at the edge device by utilizing model compression and time-varying model split between the edge and remote nodes.
Our proposed solution results in minimal energy consumption and $CO$ emission compared to the considered baselines.
arXiv Detail & Related papers (2021-06-02T07:36:27Z) - NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function
Combinational Logic [4.119948826527649]
Field-programmable gate array (FPGA)-based accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms.
This paper presents NullaNet Tiny, a framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators.
arXiv Detail & Related papers (2021-04-07T00:16:39Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.