Related papers: ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge

ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge

URL: http://arxiv.org/abs/2512.20276v1
Date: Tue, 23 Dec 2025 11:29:03 GMT
Title: ActionFlow: A Pipelined Action Acceleration for Vision Language Models on Edge
Authors: Yuntao Dai, Hang Gu, Teng Wang, Qianyu Cheng, Yifei Zheng, Zhiyong Qiu, Lei Gong, Wenqi Lou, Xuehai Zhou,
Abstract summary: Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control.<n>Current VLA models operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding.<n>We introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms.
Score: 11.016302257907936
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language-Action (VLA) models have emerged as a unified paradigm for robotic perception and control, enabling emergent generalization and long-horizon task execution. However, their deployment in dynamic, real-world environments is severely hin dered by high inference latency. While smooth robotic interaction requires control frequencies of 20 to 30 Hz, current VLA models typi cally operate at only 3-5 Hz on edge devices due to the memory bound nature of autoregressive decoding. Existing optimizations often require extensive retraining or compromise model accuracy. To bridge this gap, we introduce ActionFlow, a system-level inference framework tailored for resource-constrained edge plat forms. At the core of ActionFlow is a Cross-Request Pipelin ing strategy, a novel scheduler that redefines VLA inference as a macro-pipeline of micro-requests. The strategy intelligently batches memory-bound Decode phases with compute-bound Prefill phases across continuous time steps to maximize hardware utilization. Furthermore, to support this scheduling, we propose a Cross Request State Packed Forward operator and a Unified KV Ring Buffer, which fuse fragmented memory operations into efficient dense computations. Experimental results demonstrate that ActionFlow achieves a 2.55x improvement in FPS on the OpenVLA-7B model without retraining, enabling real-time dy namic manipulation on edge hardware. Our work is available at https://anonymous.4open.science/r/ActionFlow-1D47.

Related papers

LiteVLA-Edge: Quantized On-Device Multimodal Control for Embedded Robotics [0.6119773373677944]
We present LiteVLA-Edge, a deployment-oriented VLA pipeline for fully on-device inference on Jetson Orin-class hardware.<n>Our approach combines supervised image-to-action fine-tuning in FP32 with post-training 4-bit GGUF quantization and GPU-accelerated inference.<n>Under our configuration, LiteVLA-Edge achieves a mean end-to-end runtime of 150.5,ms (approximately 6.6,Hz) while operating entirely offline.
arXiv Detail & Related papers (2026-03-03T03:20:52Z)
HybridFlow: A Two-Step Generative Policy for Robotic Manipulation [2.2200541495683996]
MeanFlow, as a one-step variant of flow matching, has shown strong potential in image generation.<n>HybridFlow balances inference speed and generation quality by leveraging the rapid advantage of MeanFlow one-step generation.<n>We envision HybridFlow as a practical low-latency method to enhance real-world interaction capabilities of robotic manipulation policies.
arXiv Detail & Related papers (2026-02-14T10:50:23Z)
AsyncVLA: An Asynchronous VLA for Fast and Robust Navigation on the Edge [49.66156306240961]
High latency breaks the control loop, rendering powerful models unsafe for real-time deployment.<n>We propose AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution.<n>AsyncVLA achieves a 40% higher success rate than state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-13T21:31:19Z)
TIDAL: Temporally Interleaved Diffusion and Action Loop for High-Frequency VLA Control [15.534182843429043]
Large-scale Vision-Language-Action (VLA) models offer semantic generalization but suffer from high inference latency.<n>We propose TIDAL, a hierarchical framework that decouples semantic reasoning from high-frequency actuation.<n> TIDAL operates as a backbone-agnostic module for diffusion-basedVLAs, using a dual-frequency architecture.
arXiv Detail & Related papers (2026-01-21T12:43:11Z)
Asynchronous Fast-Slow Vision-Language-Action Policies for Whole-Body Robotic Manipulation [10.09057399213028]
Vision-Language-Action (VLA) systems integrate a Vision-Language Model (VLM) for semantic reasoning with an action expert generating continuous action signals.<n>We introduce a truly asynchronous Fast-Slow VLA framework (DuoCore-FS) to organize the system into a fast pathway for action generation and a slow pathway for rich VLM reasoning.
arXiv Detail & Related papers (2025-12-23T09:28:20Z)
ARMFlow: AutoRegressive MeanFlow for Online 3D Human Reaction Generation [48.716675019745885]
3D human reaction generation faces three main challenges: high motion fidelity, real-time inference, and autoregressive adaptability for online scenarios.<n>We propose ARMFlow, a MeanFlow-based autoregressive framework that models temporal dependencies between motions and velocity.<n>Our single-step online generation surpasses existing methods on InterHuman and InterX by over 40% in FID, while matching offline state-of-the-art performance despite using only partial sequence conditions.
arXiv Detail & Related papers (2025-12-18T06:28:42Z)
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z)
OmniSAT: Compact Action Token, Faster Auto Regression [70.70037017501357]
We introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation.<n>The resulting discrete tokenization shortens the training sequence by 6.8$times$, and lowers the target entropy.
arXiv Detail & Related papers (2025-10-08T03:55:24Z)
SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs.<n>We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions.<n>With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z)
FAST: Efficient Action Tokenization for Vision-Language-Action Models [98.15494168962563]
We propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform.<n>Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories.
arXiv Detail & Related papers (2025-01-16T18:57:04Z)
ScaleFlow++: Robust and Accurate Estimation of 3D Motion from Video [26.01796507893086]
This paper proposes a 3D motion perception method called ScaleFlow++ that is easy to generalize. With just a pair of RGB images, ScaleFlow++ can robustly estimate optical flow and motion-in-depth (MID) On KITTI, ScaleFlow++ achieved the best monocular scene flow estimation performance, reducing SF-all from 6.21 to 5.79.
arXiv Detail & Related papers (2024-09-16T11:59:27Z)
ActionFlow: Equivariant, Accurate, and Efficient Policies with Spatially Symmetric Flow Matching [20.20511152176522]
ActionFlow is a policy class that integrates spatial symmetry inductive biases. On the representation level, ActionFlow introduces an SE(3) Invariant Transformer architecture. For action generation, ActionFlow leverages Flow Matching, a state-of-the-art deep generative model.
arXiv Detail & Related papers (2024-09-06T19:30:36Z)
ScaleFlow++: Robust and Accurate Estimation of 3D Motion from Video [15.629496237910999]
This paper proposes a 3D motion perception method called ScaleFlow++ that is easy to generalize. With just a pair of RGB images, ScaleFlow++ can robustly estimate optical flow and motion-in-depth (MID) On KITTI, ScaleFlow++ achieved the best monocular scene flow estimation performance, reducing SF-all from 6.21 to 5.79.
arXiv Detail & Related papers (2024-07-13T07:58:48Z)
GMFlow: Learning Optical Flow via Global Matching [124.57850500778277]
We propose a GMFlow framework for learning optical flow estimation. It consists of three main components: a customized Transformer for feature enhancement, a correlation and softmax layer for global feature matching, and a self-attention layer for flow propagation. Our new framework outperforms 32-iteration RAFT's performance on the challenging Sintel benchmark.
arXiv Detail & Related papers (2021-11-26T18:59:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.