Action-to-Action Flow Matching
- URL: http://arxiv.org/abs/2602.07322v1
- Date: Sat, 07 Feb 2026 02:39:49 GMT
- Title: Action-to-Action Flow Matching
- Authors: Jindou Jia, Gen Li, Xiangyu Chen, Tuo An, Yuxuan Hu, Jingliang Li, Xinying Guo, Jianfei Yang,
- Abstract summary: Diffusion-based policies have recently achieved remarkable success in robotics by formulating action prediction as a conditional denoising process.<n>We propose Action-to-Action flow matching (A2A), a novel policy paradigm that shifts from random sampling to initialization informed by the previous action.<n>A2A enables high-quality action generation in as few as a single inference step (0.56 ms latency), and exhibits superior robustness to visual perturbations and enhanced generalization to unseen configurations.
- Score: 25.301629044539325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion-based policies have recently achieved remarkable success in robotics by formulating action prediction as a conditional denoising process. However, the standard practice of sampling from random Gaussian noise often requires multiple iterative steps to produce clean actions, leading to high inference latency that incurs a major bottleneck for real-time control. In this paper, we challenge the necessity of uninformed noise sampling and propose Action-to-Action flow matching (A2A), a novel policy paradigm that shifts from random sampling to initialization informed by the previous action. Unlike existing methods that treat proprioceptive action feedback as static conditions, A2A leverages historical proprioceptive sequences, embedding them into a high-dimensional latent space as the starting point for action generation. This design bypasses costly iterative denoising while effectively capturing the robot's physical dynamics and temporal continuity. Extensive experiments demonstrate that A2A exhibits high training efficiency, fast inference speed, and improved generalization. Notably, A2A enables high-quality action generation in as few as a single inference step (0.56 ms latency), and exhibits superior robustness to visual perturbations and enhanced generalization to unseen configurations. Lastly, we also extend A2A to video generation, demonstrating its broader versatility in temporal modeling. Project site: https://lorenzo-0-0.github.io/A2A_Flow_Matching.
Related papers
- Mean-Flow based One-Step Vision-Language-Action [15.497933767026568]
FlowMatching-based Vision-Language-Action (VLA) frameworks have demonstrated remarkable advantages in generating high-frequency action chunks.<n>They are constrained by prolonged generation latency, which stems from inherent iterative sampling requirements and architectural limitations.<n>We propose a Mean-Flow based One-Step VLA approach, which resolves the noise-induced issues in the action generation process.
arXiv Detail & Related papers (2026-03-02T05:30:30Z) - Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation [95.89924101984566]
We introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM)<n>GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories.<n>LCM injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory.
arXiv Detail & Related papers (2026-02-22T15:39:34Z) - STEP: Warm-Started Visuomotor Policies with Spatiotemporal Consistency Prediction [16.465783114087223]
iterative denoising leads to substantial inference latency, limiting control frequency in real-time closed-loop systems.<n>We propose STEP, a lightweighttemporal consistency prediction mechanism to construct high-quality warm-start actions.<n> STEP with 2 steps can achieve an average 21.6% and 27.5% higher success rate than BRIDGER and DDIM on the RoboMimic benchmark and real-world tasks.
arXiv Detail & Related papers (2026-02-09T03:50:40Z) - SynCast: Synergizing Contradictions in Precipitation Nowcasting via Diffusion Sequential Preference Optimization [62.958457694151384]
We introduce preference optimization into precipitation nowcasting for the first time, motivated by the success of reinforcement learning from human feedback in large language models.<n>In the first stage, the framework focuses on reducing FAR, training the model to effectively suppress false alarms.
arXiv Detail & Related papers (2025-10-22T16:11:22Z) - OmniSAT: Compact Action Token, Faster Auto Regression [70.70037017501357]
We introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation.<n>The resulting discrete tokenization shortens the training sequence by 6.8$times$, and lowers the target entropy.
arXiv Detail & Related papers (2025-10-08T03:55:24Z) - NinA: Normalizing Flows in Action. Training VLA Models with Normalizing Flows [75.70583906344815]
Diffusion models have been widely adopted as action decoders due to their ability to model complex, multimodal action distributions.<n>We present NinA, a fast and expressive alternative to diffusion-based decoders for Vision-Language-Action (VLA) models.
arXiv Detail & Related papers (2025-08-23T00:02:15Z) - Real-Time Iteration Scheme for Diffusion Policy [23.124189676943757]
We introduce a novel approach inspired by the Real-Time Iteration (RTI) Scheme to accelerate inference.<n>We propose a scaling-based method to effectively handle discrete actions, such as grasping, in robotic manipulation.<n>The proposed scheme significantly reduces runtime computational costs without the need for distillation or policy redesign.
arXiv Detail & Related papers (2025-08-07T13:49:00Z) - Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly Detection [43.49146665908238]
Video anomaly detection (VAD) is a vital yet complex open-set task in computer vision.<n>We introduce a novel frequency-guided diffusion model with perturbation training.<n>We employ the 2D Discrete Cosine Transform (DCT) to separate high-frequency (local) and low-frequency (global) motion components.
arXiv Detail & Related papers (2024-12-04T05:43:53Z) - Bidirectional Decoding: Improving Action Chunking via Guided Test-Time Sampling [51.38330727868982]
We show how action chunking impacts the divergence between a learner and a demonstrator.<n>We propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop adaptation.<n>Our method boosts the performance of two state-of-the-art generative policies across seven simulation benchmarks and two real-world tasks.
arXiv Detail & Related papers (2024-08-30T15:39:34Z) - Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs)
GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations.
We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z) - DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion [137.8749239614528]
We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD.
Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video.
arXiv Detail & Related papers (2023-03-27T00:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.