Related papers: VITA: Vision-to-Action Flow Matching Policy

VITA: Vision-to-Action Flow Matching Policy

URL: http://arxiv.org/abs/2507.13231v1
Date: Thu, 17 Jul 2025 15:41:57 GMT
Title: VITA: Vision-to-Action Flow Matching Policy
Authors: Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani,
Abstract summary: We present VITA, a Vision-To-Action flow matching policy that evolves latent visual representations into latent actions for visuomotor control.<n>VITA proposes a novel paradigm that treats latent images as the flow source, learning an inherent mapping from vision to action while eliminating separate conditioning modules.<n>VITA is evaluated on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks.
Score: 25.146609391293087
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present VITA, a Vision-To-Action flow matching policy that evolves latent visual representations into latent actions for visuomotor control. Traditional flow matching and diffusion policies sample from standard source distributions (e.g., Gaussian noise) and require additional conditioning mechanisms like cross-attention to condition action generation on visual information, creating time and space overheads. VITA proposes a novel paradigm that treats latent images as the flow source, learning an inherent mapping from vision to action while eliminating separate conditioning modules and preserving generative modeling capabilities. Learning flows between fundamentally different modalities like vision and action is challenging due to sparse action data lacking semantic structures and dimensional mismatches between high-dimensional visual representations and raw actions. We address this by creating a structured action latent space via an autoencoder as the flow matching target, up-sampling raw actions to match visual representation shapes. Crucially, we supervise flow matching with both encoder targets and final action outputs through flow latent decoding, which backpropagates action reconstruction loss through sequential flow matching ODE solving steps for effective end-to-end learning. Implemented as simple MLP layers, VITA is evaluated on challenging bi-manual manipulation tasks on the ALOHA platform, including 5 simulation and 2 real-world tasks. Despite its simplicity, MLP-only VITA outperforms or matches state-of-the-art generative policies while reducing inference latency by 50-130% compared to conventional flow matching policies requiring different conditioning mechanisms or complex architectures. To our knowledge, VITA is the first MLP-only flow matching policy capable of solving complex bi-manual manipulation tasks like those in ALOHA benchmarks.

Related papers

LEAR: Learning Edge-Aware Representations for Event-to-LiDAR Localization [15.308350522323588]
We propose LEAR, a dual-task learning framework that jointly estimates edge structures and dense event-depth flow fields to bridge the sensing-modality divide.<n>On several popular and challenging datasets, LEAR achieves superior performance over the best prior method.
arXiv Detail & Related papers (2026-03-02T13:18:25Z)
Action-to-Action Flow Matching [25.301629044539325]
Diffusion-based policies have recently achieved remarkable success in robotics by formulating action prediction as a conditional denoising process.<n>We propose Action-to-Action flow matching (A2A), a novel policy paradigm that shifts from random sampling to initialization informed by the previous action.<n>A2A enables high-quality action generation in as few as a single inference step (0.56 ms latency), and exhibits superior robustness to visual perturbations and enhanced generalization to unseen configurations.
arXiv Detail & Related papers (2026-02-07T02:39:49Z)
\ extsc{NaVIDA}: Vision-Language Navigation with Inverse Dynamics Augmentation [50.027425808733994]
textscNaVIDA is a unified VLN framework that couples policy learning with action-grounded visual dynamics and adaptive execution.<n>textscNaVIDA augments training with chunk-based inverse-dynamics supervision to learn causal relationship between visual changes and corresponding actions.<n>Experiments show that textscNaVIDA achieves superior navigation performance compared to state-of-the-art methods with fewer parameters.
arXiv Detail & Related papers (2026-01-26T06:16:17Z)
Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z)
Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process [48.63424123606625]
Vision-language-action (VLA) models aim to understand natural language instructions and visual observations.<n>Recent work integrates future images into the understanding-acting loop, yielding unified VLAs that jointly understand, generate, and act.<n>We propose a two-stage training pipeline and several inference-time techniques that optimize performance and efficiency.
arXiv Detail & Related papers (2025-11-03T16:26:54Z)
NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows [75.70583906344815]
Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions.<n>We present NinA, a fast and expressive alternative to diffusion-based decoders for Vision-Language-Action (VLA) models.
arXiv Detail & Related papers (2025-08-23T00:02:15Z)
SCALAR: Scale-wise Controllable Visual Autoregressive Learning [15.775596699630633]
We present SCALAR, a controllable generation method based on Visual Autoregressive ( VAR)<n>We leverage a pretrained image encoder to extract semantic control signal encodings, which are projected into scale-specific representations and injected into the corresponding layers of the VAR backbone.<n>Building on SCALAR, we develop SCALAR-Uni, a unified extension that aligns multiple control modalities into a shared latent space, supporting flexible multi-conditional guidance in a single model.
arXiv Detail & Related papers (2025-07-26T13:23:08Z)
Dynamic Context-oriented Decomposition for Task-aware Low-rank Adaptation with Less Forgetting and Faster Convergence [131.41894248194995]
We propose context-oriented decomposition adaptation (CorDA), a novel method that initializes adapters in a task-aware manner.<n>Thanks to the task awareness, our method enables two optional adaptation modes, knowledge-preserved mode (KPM) and instruction-previewed mode (IPM)
arXiv Detail & Related papers (2025-06-16T07:55:14Z)
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities [76.46448367752944]
multimodal large language models (MLLMs) unify visual understanding and image generation within a single framework.<n>Most existing MLLMs rely on autore (AR) architectures, which impose inherent limitations on future development.<n>We introduce FUDOKI, a unified multimodal model purely based on discrete flow matching.
arXiv Detail & Related papers (2025-05-26T15:46:53Z)
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model [54.64088247291416]
A fundamental objective of manipulation policy design is to endow robots to comprehend human instructions, reason about scene cues, and execute generalized actions in dynamic environments.<n>Recent autoregressive vision-language-action (VLA) methods inherit common-sense reasoning capabilities from vision-language models (VLMs) for next action-token prediction.<n>We introduce HybridVLA, a unified framework that absorbs the continuous nature of diffusion-based actions and the contextual reasoning of autoregression.
arXiv Detail & Related papers (2025-03-13T17:59:52Z)
Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.<n>It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)<n>It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z)
MOOSS: Mask-Enhanced Temporal Contrastive Learning for Smooth State Evolution in Visual Reinforcement Learning [8.61492882526007]
In visual Reinforcement Learning (RL), learning from pixel-based observations poses significant challenges on sample efficiency. We introduce MOOSS, a novel framework that leverages a temporal contrastive objective with the help of graph-based spatial-temporal masking. Our evaluation on multiple continuous and discrete control benchmarks shows that MOOSS outperforms previous state-of-the-art visual RL methods in terms of sample efficiency.
arXiv Detail & Related papers (2024-09-02T18:57:53Z)
MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks. We propose a single-stage and standalone method, MOCA, which unifies both desired properties. We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z)
Adapting Self-Supervised Vision Transformers by Probing Attention-Conditioned Masking Consistency [7.940705941237998]
We propose PACMAC, a simple two-stage adaptation algorithm for self-supervised ViTs. Our simple approach leads to consistent performance gains over competing methods.
arXiv Detail & Related papers (2022-06-16T14:46:10Z)
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval. Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference. Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z)
Self-Supervised Learning of Non-Rigid Residual Flow and Ego-Motion [63.18340058854517]
We present an alternative method for end-to-end scene flow learning by joint estimation of non-rigid residual flow and ego-motion flow for dynamic 3D scenes. We extend the supervised framework with self-supervisory signals based on the temporal consistency property of a point cloud sequence.
arXiv Detail & Related papers (2020-09-22T11:39:19Z)
Goal-Conditioned End-to-End Visuomotor Control for Versatile Skill Primitives [89.34229413345541]
We propose a conditioning scheme which avoids pitfalls by learning the controller and its conditioning in an end-to-end manner. Our model predicts complex action sequences based directly on a dynamic image representation of the robot motion. We report significant improvements in task success over representative MPC and IL baselines.
arXiv Detail & Related papers (2020-03-19T15:04:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.