VFP: Variational Flow-Matching Policy for Multi-Modal Robot Manipulation
- URL: http://arxiv.org/abs/2508.01622v2
- Date: Thu, 02 Oct 2025 13:01:54 GMT
- Title: VFP: Variational Flow-Matching Policy for Multi-Modal Robot Manipulation
- Authors: Xuanran Zhai, Qianyou Zhao, Qiaojun Yu, Ce Hao,
- Abstract summary: Variational Flow-Matching Policy (VFP) is a flow-matching policy that captures both task-level and trajectory-level multi-modality.<n>VFP achieves a 49% relative improvement in task success rate over standard flow-based baselines in simulation.
- Score: 3.986404588605909
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Flow-matching-based policies have recently emerged as a promising approach for learning-based robot manipulation, offering significant acceleration in action sampling compared to diffusion-based policies. However, conventional flow-matching methods struggle with multi-modality, often collapsing to averaged or ambiguous behaviors in complex manipulation tasks. To address this, we propose the Variational Flow-Matching Policy (VFP), which introduces a variational latent prior for mode-aware action generation and effectively captures both task-level and trajectory-level multi-modality. VFP further incorporates Kantorovich Optimal Transport (K-OT) for distribution-level alignment and utilizes a Mixture-of-Experts (MoE) decoder for mode specialization and efficient inference. We comprehensively evaluate VFP on 41 simulated tasks and 3 real-robot tasks, demonstrating its effectiveness and sampling efficiency in both simulated and real-world settings. Results show that VFP achieves a 49% relative improvement in task success rate over standard flow-based baselines in simulation, and further outperforms them on real-robot tasks, while still maintaining fast inference and a compact model size. More details are available on our project page: https://sites.google.com/view/varfp/
Related papers
- PRISM: Performer RS-IMLE for Single-pass Multisensory Imitation Learning [51.24484551729328]
We introduce PRISM, a single-pass policy based on a batch-global rejection-sampling variant of IMLE.<n> PRISM couples a temporal multisensory encoder with a linear-attention generator using a Performer architecture.<n>We demonstrate the efficacy of PRISM on a diverse real-world hardware suite, including loco-manipulation using a Unitree Go2 with a 7-DoF arm D1 and tabletop manipulation with a UR5 manipulator.
arXiv Detail & Related papers (2026-02-02T17:57:37Z) - Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach [78.4812458793128]
We propose textbfTACO, a test-time-scaling framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks.<n>Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits.
arXiv Detail & Related papers (2025-12-02T14:42:54Z) - DM1: MeanFlow with Dispersive Regularization for 1-Step Robotic Manipulation [23.382067451764396]
Flow-based generative models have emerged as a promising solution to learning distributions of actions.<n>Existing flow-based policies suffer from representation collapse, the inability to distinguish similar visual representations, leading to failures in precise manipulation tasks.<n>We propose DM1, a novel flow matching framework that integrates dispersive regularization into MeanFlow to prevent collapse while maintaining one-step efficiency.
arXiv Detail & Related papers (2025-10-09T07:12:20Z) - Compose Your Policies! Improving Diffusion-based or Flow-based Robot Policies via Test-time Distribution-level Composition [52.232968183793986]
General Policy Composition (GPC) is a training-free method that enhances performance by combining the distributional scores of multiple pre-trained policies.<n>GPC consistently improves performance and adaptability across a diverse set of tasks.
arXiv Detail & Related papers (2025-10-01T16:05:53Z) - Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance [63.33213516925946]
We introduce textbfAlign-Then-stEer (textttATE), a novel, data-efficient, and plug-and-play adaptation framework.<n>Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.
arXiv Detail & Related papers (2025-09-02T07:51:59Z) - Decision Flow Policy Optimization [53.825268058199825]
We show that generative models can effectively model complex multi-modal action distributions and achieve superior robotic control in continuous action spaces.<n>Previous methods usually adopt the generative models as behavior models to fit state-conditioned action distributions from datasets.<n>We propose Decision Flow, a unified framework that integrates multi-modal action distribution modeling and policy optimization.
arXiv Detail & Related papers (2025-05-26T03:42:20Z) - GFlowVLM: Enhancing Multi-step Reasoning in Vision-Language Models with Generative Flow Networks [4.851402232145819]
We introduce GFlowVLM, a framework that fine-tune Vision-Language Models (VLMs) using Generative Flow Networks (GFlowNets)<n>GFlowVLM models the environment as a non-Markovian decision process, allowing it to capture long-term dependencies essential for real-world applications.<n> Empirical results demonstrate the effectiveness of GFlowVLM on complex tasks such as card games (NumberLine, BlackJack) and embodied planning tasks (ALFWorld)
arXiv Detail & Related papers (2025-03-09T08:38:10Z) - Training-Free Graph Filtering via Multimodal Feature Refinement for Extremely Fast Multimodal Recommendation [8.462186629861046]
We propose MultiModal-Graph Filtering (MM-GF) for efficient and accurate multimodal recommendations.<n> MM-GF is a training-free method based on the notion of graph filtering (GF) for efficient and accurate multimodal recommendations.<n>Experiments on real-world benchmark datasets demonstrate that MM-GF not only improves recommendation accuracy by up to 13.35% compared to the best competitor but also dramatically reduces computational costs by achieving the runtime of less than 10 seconds.
arXiv Detail & Related papers (2025-03-06T13:00:53Z) - IMLE Policy: Fast and Sample Efficient Visuomotor Policy Learning via Implicit Maximum Likelihood Estimation [3.7584322469996896]
IMLE Policy is a novel behaviour cloning approach based on Implicit Maximum Likelihood Estimation (IMLE)<n>It excels in low-data regimes, effectively learning from minimal demonstrations and requiring 38% less data on average to match the performance of baseline methods in learning complex multi-modal behaviours.<n>We validate our approach across diverse manipulation tasks in simulated and real-world environments, showcasing its ability to capture complex behaviours under data constraints.
arXiv Detail & Related papers (2025-02-17T23:22:49Z) - Flow: Modularized Agentic Workflow Automation [53.073598156915615]
Multi-agent frameworks powered by large language models (LLMs) have demonstrated great success in automated planning and task execution.<n>However, the effective adjustment of agentic during execution has not been well studied.<n>In this paper, we define an activity-on-vertex (AOV) graph, which allows continuous workflow refinement by agents.<n>Our proposed multi-agent framework achieves efficient concurrent execution of subtasks, effective goal achievement, and enhanced error tolerance.
arXiv Detail & Related papers (2025-01-14T04:35:37Z) - VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction [21.061630022134203]
In-Context Operator Networks (ICONs) learn operators across diverse partial differential equations using few-shot, in-context learning.<n>Existing ICONs process each spatial point as an individual token, severely limiting computational efficiency when handling dense data in higher spatial dimensions.<n>We propose Vision In-Context Operator Networks (VICON), which integrates vision transformer architectures to efficiently process 2D data through patch-wise operations.
arXiv Detail & Related papers (2024-11-25T03:25:17Z) - On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities.
The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations.
We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z) - PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation [68.17081518640934]
We propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R)
PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module.
Our PIVOT-R outperforms state-of-the-art open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks.
arXiv Detail & Related papers (2024-10-14T11:30:18Z) - MASSFormer: Mobility-Aware Spectrum Sensing using Transformer-Driven
Tiered Structure [3.6194127685460553]
We develop a mobility-aware transformer-driven structure (MASSFormer) based cooperative sensing method.
Our method considers a dynamic scenario involving mobile primary users (PUs) and secondary users (SUs)
The proposed method is tested under imperfect reporting channel scenarios to show robustness.
arXiv Detail & Related papers (2024-09-26T05:25:25Z) - Affordance-based Robot Manipulation with Flow Matching [7.51335919610328]
We present a framework for assistive robot manipulation.<n>We tackle two challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, and second, effectively learning robot action trajectories by grounding the visual affordance model.<n>We learn robot action trajectories guided by affordances in a supervised flow matching method.
arXiv Detail & Related papers (2024-09-02T09:11:28Z) - Riemannian Flow Matching Policy for Robot Motion Learning [5.724027955589408]
We introduce a novel model for learning and synthesizing robot visuomotor policies.
We show that RFMP provides smoother action trajectories with significantly lower inference times.
arXiv Detail & Related papers (2024-03-15T20:48:41Z) - Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data.
For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z) - Fast Trainable Projection for Robust Fine-Tuning [36.51660287722338]
Robust fine-tuning aims to achieve competitive in-distribution (ID) performance.
Projection-based fine-tuning has been successfully used in robust fine-tuning.
Fast Trainable Projection is a new projection-based fine-tuning algorithm.
arXiv Detail & Related papers (2023-10-29T22:52:43Z) - Dynamic Multimodal Fusion [8.530680502975095]
Dynamic multimodal fusion (DynMM) is a new approach that adaptively fuses multimodal data and generates data-dependent forward paths during inference.
Results on various multimodal tasks demonstrate the efficiency and wide applicability of our approach.
arXiv Detail & Related papers (2022-03-31T21:35:13Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - Fast Variational AutoEncoder with Inverted Multi-Index for Collaborative
Filtering [59.349057602266]
Variational AutoEncoder (VAE) has been extended as a representative nonlinear method for collaborative filtering.
We propose to decompose the inner-product-based softmax probability based on the inverted multi-index.
FastVAE can outperform the state-of-the-art baselines in terms of both sampling quality and efficiency.
arXiv Detail & Related papers (2021-09-13T08:31:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.