Related papers: SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration

SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration

URL: http://arxiv.org/abs/2506.12723v2
Date: Thu, 19 Jun 2025 07:54:06 GMT
Title: SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration
Authors: Ye Li, Yuan Meng, Zewen Sun, Kangye Ji, Chen Tang, Jiajun Fan, Xinzhu Ma, Shutao Xia, Zhi Wang, Wenwu Zhu,
Abstract summary: Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities.<n>Their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation.<n>We propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens.
Score: 69.54069477520534
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities. However, their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation. Existing VLA acceleration methods primarily focus on structural optimization, overlooking the fact that these models operate in sequential decision-making environments. As a result, temporal redundancy in sequential action generation and spatial redundancy in visual input remain unaddressed. To this end, we propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens. Specifically, we design an action-aware model scheduling mechanism that reduces temporal redundancy by dynamically switching between VLA model and a lightweight generator. Inspired by the human motion pattern of focusing on key decision points while relying on intuition for other actions, we categorize VLA actions into deliberative and intuitive, assigning the former to the VLA model and the latter to the lightweight generator, enabling frequency-adaptive execution through collaborative model scheduling. To address spatial redundancy, we further develop a spatio-semantic dual-aware token pruning method. Tokens are classified into spatial and semantic types and pruned based on their dual-aware importance to accelerate VLA inference. These two mechanisms work jointly to guide the VLA in focusing on critical actions and salient visual information, achieving effective acceleration while maintaining high accuracy. Experimental results demonstrate that our method achieves up to 1.5$\times$ acceleration with less than 3% drop in accuracy, outperforming existing approaches in multiple tasks.

Related papers

Spec-VLA: Speculative Decoding for Vision-Language-Action Models with Relaxed Acceptance [27.47568622830058]
This work introduces Spec-VLA, an SD framework designed to accelerate Vision-Language-Action (VLA) models.<n>To boost the generation speed, we propose an effective mechanism to relax acceptance utilizing the relative distances represented by the action tokens of the VLA model.
arXiv Detail & Related papers (2025-07-30T07:04:09Z)
EdgeVLA: Efficient Vision-Language-Action Models [0.4005096060512278]
This paper introduces Edge VLA, a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models.<n>We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs)<n>Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency.
arXiv Detail & Related papers (2025-07-18T16:15:09Z)
CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation [67.1520483301709]
CronusVLA is a unified framework that extends single-frame VLA models to the multi-frame paradigm through an efficient post-training stage.<n>CronusVLA achieves state-of-the-art performance on SimplerEnv with 70.9% success rate, and 12.7% improvement over OpenVLA on LIBERO.
arXiv Detail & Related papers (2025-06-24T17:30:27Z)
Lightweight Task-Oriented Semantic Communication Empowered by Large-Scale AI Models [66.57755931421285]
Large-scale artificial intelligence (LAI) models pose significant challenges for real-time communication scenarios.<n>This paper proposes utilizing knowledge distillation (KD) techniques to extract and condense knowledge from LAI models.<n>We propose a fast distillation method featuring a pre-stored compression mechanism that eliminates the need for repetitive inference.
arXiv Detail & Related papers (2025-06-16T08:42:16Z)
Think Twice, Act Once: Token-Aware Compression and Action Reuse for Efficient Inference in Vision-Language-Action Models [30.7855782696894]
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for general-purpose robot control through natural language instructions.<n>We propose FlashVLA, the first training-free and plug-and-play acceleration framework that enables action reuse in VLA models.
arXiv Detail & Related papers (2025-05-27T13:47:18Z)
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z)
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model [54.64088247291416]
We introduce HybridVLA, a unified framework that seamlessly integrates autoregressive and diffusion policies within a single large language model.<n>With this recipe, we find that these two forms of action prediction not only reinforce each other but also exhibit varying performance across different tasks.<n>In experiments, HybridVLA outperforms previous state-of-the-art VLA methods across various simulation and real-world tasks.
arXiv Detail & Related papers (2025-03-13T17:59:52Z)
Accelerating Vision-Language-Action Model Integrated with Action Chunking via Parallel Decoding [24.1236728596359]
Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation.<n>We propose PD-VLA, the first parallel decoding framework for VLA models integrated with action chunking.<n>Our framework reformulates autoregressive decoding as a nonlinear system solved by parallel fixed-point iterations.
arXiv Detail & Related papers (2025-03-04T06:12:08Z)
TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation [32.406783380729024]
Vision-Language-Action (VLA) models have shown remarkable potential in visuomotor control and instruction comprehension through end-to-end learning processes.<n>Current VLA models face significant challenges: they are slow during inference and require extensive pre-training on large amounts of robotic data.<n>We introduce a new family of compact vision-language-action models, called TinyVLA, which offers two key advantages over existing VLA models.
arXiv Detail & Related papers (2024-09-19T07:10:18Z)
A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies [51.7643024367548]
Stable Diffusion Model is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation. This study focuses on reducing redundant computation in SDM and optimizing the model through both tuning and tuning-free methods.
arXiv Detail & Related papers (2024-05-31T21:47:05Z)
A Self-Correcting Vision-Language-Action Model for Fast and Slow System Manipulation [30.207690822989292]
Self-corrected (SC-)VLA framework integrates fast system for directly predicting actions and slow system for reflecting on failed actions.<n>For the fast system, we incorporate parameter-efficient fine-tuning to equip the model with pose prediction capabilities.<n>For the slow system, we propose a Chain-of-Thought training strategy for failure correction, designed to mimic human reflection after a manipulation failure.
arXiv Detail & Related papers (2024-05-27T17:58:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.