Related papers: FreqPolicy: Frequency Autoregressive Visuomotor Policy with Continuous Tokens

FreqPolicy: Frequency Autoregressive Visuomotor Policy with Continuous Tokens

URL: http://arxiv.org/abs/2506.01583v1
Date: Mon, 02 Jun 2025 12:13:51 GMT
Title: FreqPolicy: Frequency Autoregressive Visuomotor Policy with Continuous Tokens
Authors: Yiming Zhong, Yumeng Liu, Chuyang Xiao, Zemin Yang, Youzhuo Wang, Yufei Zhu, Ye Shi, Yujing Sun, Xinge Zhu, Yuexin Ma,
Abstract summary: We propose a novel paradigm for visuomotor policy learning that progressively models hierarchical frequency components.<n>To further enhance precision, we introduce continuous latent representations that maintain smoothness and continuity in the action space.<n>Our approach outperforms existing methods in both accuracy and efficiency.
Score: 20.715024408481973
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Learning effective visuomotor policies for robotic manipulation is challenging, as it requires generating precise actions while maintaining computational efficiency. Existing methods remain unsatisfactory due to inherent limitations in the essential action representation and the basic network architectures. We observe that representing actions in the frequency domain captures the structured nature of motion more effectively: low-frequency components reflect global movement patterns, while high-frequency components encode fine local details. Additionally, robotic manipulation tasks of varying complexity demand different levels of modeling precision across these frequency bands. Motivated by this, we propose a novel paradigm for visuomotor policy learning that progressively models hierarchical frequency components. To further enhance precision, we introduce continuous latent representations that maintain smoothness and continuity in the action space. Extensive experiments across diverse 2D and 3D robotic manipulation benchmarks demonstrate that our approach outperforms existing methods in both accuracy and efficiency, showcasing the potential of a frequency-domain autoregressive framework with continuous tokens for generalized robotic manipulation.

Related papers

ActionSink: Toward Precise Robot Manipulation with Dynamic Integration of Action Flow [93.00917887667234]
This paper introduces a novel robot manipulation framework, i.e., ActionSink, to pave the way toward precise action estimations.<n>As the name suggests, ActionSink reformulates the actions of robots as action-caused optical flows from videos, called "action flow"<n>Our framework outperformed prior SOTA on the LIBERO benchmark by a 7.9% success rate, and obtained nearly an 8% accuracy gain on the challenging long-horizon visual task LIBERO-Long.
arXiv Detail & Related papers (2025-08-05T08:46:17Z)
Learning to Move in Rhythm: Task-Conditioned Motion Policies with Orbital Stability Guarantees [45.137864140049814]
We introduce Orbitally Stable Motion Primitives (OSMPs) - a framework that combines a learned diffeomorphic encoder with a supercritical Hopf bifurcation in latent space.<n>We validate the proposed approach through extensive simulation and real-world experiments across a diverse range of robotic platforms.
arXiv Detail & Related papers (2025-07-12T17:10:03Z)
ManiGaussian++: General Robotic Bimanual Manipulation with Hierarchical Gaussian World Model [52.02220087880269]
We propose an extension of ManiGaussian framework that improves bimanual manipulation by digesting multi-task scene dynamics through a hierarchical world model.<n>Our method significantly outperforms the current state-of-the-art bimanual manipulation techniques by an improvement of 20.2% in 10 simulated tasks, and 60% success rate on average in 9 challenging real-world tasks.
arXiv Detail & Related papers (2025-06-24T17:59:06Z)
FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency [34.81668269819768]
We propose FreqPolicy to exploit temporal information in robotic manipulation.<n>FreqPolicy first imposes frequency consistency constraints on flow-based visuomotor policies.<n>We show efficiency and effectiveness in real-world robotic scenarios with an inference frequency 93.5Hz.
arXiv Detail & Related papers (2025-06-10T14:12:53Z)
Dynamic Manipulation of Deformable Objects in 3D: Simulation, Benchmark and Learning Strategy [88.8665000676562]
Prior methods often simplify the problem to low-speed or 2D settings, limiting their applicability to real-world 3D tasks.<n>To mitigate data scarcity, we introduce a novel simulation framework and benchmark grounded in reduced-order dynamics.<n>We propose Dynamics Informed Diffusion Policy (DIDP), a framework that integrates imitation pretraining with physics-informed test-time adaptation.
arXiv Detail & Related papers (2025-05-23T03:28:25Z)
Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion [33.9786226622757]
We propose a robust motion generation framework MoMADiff, which combines masked modeling with diffusion processes to generate motion.<n>Our model supports flexible user-provided specification, enabling precise control over both spatial and temporal aspects of motion synthesis.<n>Our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and adherence.
arXiv Detail & Related papers (2025-05-16T09:06:15Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.<n>Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.<n>Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
CAIMAN: Causal Action Influence Detection for Sample-efficient Loco-manipulation [17.94272840532448]
We present CAIMAN, a reinforcement learning framework that encourages robots to gain control over other entities in the environment.<n>We empirically demonstrate CAIMAN's superior sample efficiency and adaptability to diverse scenarios in simulation.
arXiv Detail & Related papers (2025-02-02T16:16:53Z)
Robotic World Model: A Neural Network Simulator for Robust Policy Optimization in Robotics [50.191655141020505]
This work advances model-based reinforcement learning by addressing the challenges of long-horizon prediction, error accumulation, and sim-to-real transfer.<n>By providing a scalable and robust framework, the introduced methods pave the way for adaptive and efficient robotic systems in real-world applications.
arXiv Detail & Related papers (2025-01-17T10:39:09Z)
FAST: Efficient Action Tokenization for Vision-Language-Action Models [98.15494168962563]
We propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform.<n>Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories.
arXiv Detail & Related papers (2025-01-16T18:57:04Z)
Diffusion Transformer Policy [48.50988753948537]
We propose a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, to model continuous end-effector actions.<n>By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets.
arXiv Detail & Related papers (2024-10-21T12:43:54Z)
Affordance-based Robot Manipulation with Flow Matching [6.863932324631107]
We present a framework for assistive robot manipulation.<n>We tackle two challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, and second, effectively learning robot action trajectories by grounding the visual affordance model.<n>We learn robot action trajectories guided by affordances in a supervised flow matching method.
arXiv Detail & Related papers (2024-09-02T09:11:28Z)
Unsupervised Learning of Effective Actions in Robotics [0.9374652839580183]
Current state-of-the-art action representations in robotics lack proper effect-driven learning of the robot's actions. We propose an unsupervised algorithm to discretize a continuous motion space and generate "action prototypes" We evaluate our method on a simulated stair-climbing reinforcement learning task.
arXiv Detail & Related papers (2024-04-03T13:28:52Z)
Next Steps: Learning a Disentangled Gait Representation for Versatile Quadruped Locomotion [69.87112582900363]
Current planners are unable to vary key gait parameters continuously while the robot is in motion. In this work we address this limitation by learning a latent space capturing the key stance phases constituting a particular gait. We demonstrate that specific properties of the drive signal map directly to gait parameters such as cadence, foot step height and full stance duration.
arXiv Detail & Related papers (2021-12-09T10:02:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.