Mean-Flow based One-Step Vision-Language-Action
- URL: http://arxiv.org/abs/2603.01469v1
- Date: Mon, 02 Mar 2026 05:30:30 GMT
- Title: Mean-Flow based One-Step Vision-Language-Action
- Authors: Yang Chen, Xiaoguang Ma, Bin Zhao,
- Abstract summary: FlowMatching-based Vision-Language-Action (VLA) frameworks have demonstrated remarkable advantages in generating high-frequency action chunks.<n>They are constrained by prolonged generation latency, which stems from inherent iterative sampling requirements and architectural limitations.<n>We propose a Mean-Flow based One-Step VLA approach, which resolves the noise-induced issues in the action generation process.
- Score: 15.497933767026568
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in FlowMatching-based Vision-Language-Action (VLA) frameworks have demonstrated remarkable advantages in generating high-frequency action chunks, particularly for highly dexterous robotic manipulation tasks. Despite these notable achievements, their practical applications are constrained by prolonged generation latency, which stems from inherent iterative sampling requirements and architectural limitations. To address this critical bottleneck, we propose a Mean-Flow based One-Step VLA approach. Specifically, we resolve the noise-induced issues in the action generation process, thereby eliminating the consistency constraints inherent to conventional Flow-Matching methods. This significantly enhances generation efficiency and enables one-step action generation. Real-world robotic experiments show that the generation speed of the proposed Mean-Flow based One-Step VLA is 8.7 times and 83.9 times faster than that of SmolVLA and Diffusion Policy, respectively. These results elucidate its great potential as a high-efficiency backbone for VLA-based robotic manipulation.
Related papers
- Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation [95.89924101984566]
We introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM)<n>GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories.<n>LCM injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory.
arXiv Detail & Related papers (2026-02-22T15:39:34Z) - Action-to-Action Flow Matching [25.301629044539325]
Diffusion-based policies have recently achieved remarkable success in robotics by formulating action prediction as a conditional denoising process.<n>We propose Action-to-Action flow matching (A2A), a novel policy paradigm that shifts from random sampling to initialization informed by the previous action.<n>A2A enables high-quality action generation in as few as a single inference step (0.56 ms latency), and exhibits superior robustness to visual perturbations and enhanced generalization to unseen configurations.
arXiv Detail & Related papers (2026-02-07T02:39:49Z) - VLA-RAIL: A Real-Time Asynchronous Inference Linker for VLA Models and Robots [5.308743386891208]
Vision-Language-Action (VLA) models have achieved remarkable breakthroughs in robotics.<n>The strategies for fusing a queue of successive action chunks have a profound impact on the overall performance of VLA models.<n>Existing methods suffer from jitter, stalling, or even pauses in robotic action execution.<n>This paper introduces VLA-RAIL, a novel framework designed to conduct model inference and robot motion control asynchronously.
arXiv Detail & Related papers (2025-12-31T06:59:42Z) - FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization [61.10456021136654]
We introduce FASTer, a unified framework for efficient and general robot learning.<n>FASTerVQ encodes action chunks as single-channel images, capturing global-temporal dependencies while maintaining a high compression ratio.<n>FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance.
arXiv Detail & Related papers (2025-12-04T16:21:38Z) - Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z) - dVLA: Diffusion Vision-Language-Action Model with Multimodal Chain-of-Thought [66.78110237549087]
Vision-Language-Action (VLA) models are emerging as a next-generation paradigm for robotics.<n>We introduce dVLA, a diffusion-based VLA that unifies visual perception, language reasoning, and robotic control in a single system.
arXiv Detail & Related papers (2025-09-30T02:36:11Z) - NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows [75.70583906344815]
Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions.<n>We present NinA, a fast and expressive alternative to diffusion-based decoders for Vision-Language-Action (VLA) models.
arXiv Detail & Related papers (2025-08-23T00:02:15Z) - SP-VLA: A Joint Model Scheduling and Token Pruning Approach for VLA Model Acceleration [70.72227437717467]
Vision-Language-Action (VLA) models have attracted increasing attention for their strong control capabilities.<n>Their high computational cost and low execution frequency hinder their suitability for real-time tasks such as robotic manipulation and autonomous navigation.<n>We propose SP-VLA, a unified framework that accelerates VLA models by jointly scheduling models and pruning tokens.
arXiv Detail & Related papers (2025-06-15T05:04:17Z) - FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency [34.81668269819768]
We propose FreqPolicy to exploit temporal information in robotic manipulation.<n>FreqPolicy first imposes frequency consistency constraints on flow-based visuomotor policies.<n>We show efficiency and effectiveness in real-world robotic scenarios with an inference frequency 93.5Hz.
arXiv Detail & Related papers (2025-06-10T14:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.