ActionSink: Toward Precise Robot Manipulation with Dynamic Integration of Action Flow
- URL: http://arxiv.org/abs/2508.03218v1
- Date: Tue, 05 Aug 2025 08:46:17 GMT
- Title: ActionSink: Toward Precise Robot Manipulation with Dynamic Integration of Action Flow
- Authors: Shanshan Guo, Xiwen Liang, Junfan Lin, Yuzheng Zhuang, Liang Lin, Xiaodan Liang,
- Abstract summary: This paper introduces a novel robot manipulation framework, i.e., ActionSink, to pave the way toward precise action estimations.<n>As the name suggests, ActionSink reformulates the actions of robots as action-caused optical flows from videos, called "action flow"<n>Our framework outperformed prior SOTA on the LIBERO benchmark by a 7.9% success rate, and obtained nearly an 8% accuracy gain on the challenging long-horizon visual task LIBERO-Long.
- Score: 93.00917887667234
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language-instructed robot manipulation has garnered significant interest due to the potential of learning from collected data. While the challenges in high-level perception and planning are continually addressed along the progress of general large pre-trained models, the low precision of low-level action estimation has emerged as the key limiting factor in manipulation performance. To this end, this paper introduces a novel robot manipulation framework, i.e., ActionSink, to pave the way toward precise action estimations in the field of learning-based robot manipulation. As the name suggests, ActionSink reformulates the actions of robots as action-caused optical flows from videos, called "action flow", in a self-supervised manner, which are then used to be retrieved and integrated to enhance the action estimation. Specifically, ActionSink incorporates two primary modules. The first module is a coarse-to-fine action flow matcher, which continuously refines the accuracy of action flow via iterative retrieval and denoising process. The second module is a dynamic action flow integrator, which employs a working memory pool that dynamically and efficiently manages the historical action flows that should be used to integrate to enhance the current action estimation. In this module, a multi-layer fusion module is proposed to integrate direct estimation and action flows from both the current and the working memory, achieving highly accurate action estimation through a series of estimation-integration processes. Our ActionSink framework outperformed prior SOTA on the LIBERO benchmark by a 7.9\% success rate, and obtained nearly an 8\% accuracy gain on the challenging long-horizon visual task LIBERO-Long.
Related papers
- EC-Flow: Enabling Versatile Robotic Manipulation from Action-Unlabeled Videos via Embodiment-Centric Flow [10.674192015199996]
Embodiment-Centric Flow (EC-Flow) is a framework that learns manipulation from action-unlabeled videos.<n>Our key insight is that incorporating the embodiment's inherent kinematics significantly enhances generalization to versatile manipulation scenarios.<n> translating EC-Flow to executable robot actions only requires a standard robot URDF file to specify kinematic constraints.
arXiv Detail & Related papers (2025-07-08T17:57:03Z) - FlowRAM: Grounding Flow Matching Policy with Region-Aware Mamba Framework for Robotic Manipulation [34.045199714747596]
FlowRAM is a novel framework that leverages generative models to achieve region-aware perception.<n>FlowRAM is able to generate physically plausible actions for a variety of real-world tasks in less than 4 time steps.
arXiv Detail & Related papers (2025-06-19T10:30:02Z) - Action Flow Matching for Continual Robot Learning [57.698553219660376]
Continual learning in robotics seeks systems that can constantly adapt to changing environments and tasks.<n>We introduce a generative framework leveraging flow matching for online robot dynamics model alignment.<n>We find that by transforming the actions themselves rather than exploring with a misaligned model, the robot collects informative data more efficiently.
arXiv Detail & Related papers (2025-04-25T16:26:15Z) - FAST: Efficient Action Tokenization for Vision-Language-Action Models [98.15494168962563]
We propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform.<n>Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories.
arXiv Detail & Related papers (2025-01-16T18:57:04Z) - Coarse-to-fine Q-Network with Action Sequence for Data-Efficient Robot Learning [62.3886343725955]
We introduce a novel value-based reinforcement learning algorithm that learns a critic network that outputs Q-values over a sequence of actions.<n>Experiments show that CQN-AS outperforms several baselines on a variety of sparse-reward humanoid control and tabletop manipulation tasks.
arXiv Detail & Related papers (2024-11-19T01:23:52Z) - PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation [68.17081518640934]
We propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R)
PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module.
Our PIVOT-R outperforms state-of-the-art open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks.
arXiv Detail & Related papers (2024-10-14T11:30:18Z) - Imitation Learning with Limited Actions via Diffusion Planners and Deep Koopman Controllers [23.292429025366417]
We propose a plan-then-control framework aimed at improving the action-data efficiency of inverse dynamics controllers.<n>Specifically, we adopt a Deep Koopman Operator framework to model the dynamical system and utilize observation-only trajectories to learn a latent action representation.<n>This latent representation can then be effectively mapped to real high-dimensional continuous actions using a linear action decoder.
arXiv Detail & Related papers (2024-10-10T03:33:57Z) - Affordance-based Robot Manipulation with Flow Matching [6.863932324631107]
We present a framework for assistive robot manipulation.<n>We tackle two challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, and second, effectively learning robot action trajectories by grounding the visual affordance model.<n>We learn robot action trajectories guided by affordances in a supervised flow matching method.
arXiv Detail & Related papers (2024-09-02T09:11:28Z) - BiKC: Keypose-Conditioned Consistency Policy for Bimanual Robotic Manipulation [48.08416841005715]
We introduce a novel keypose-conditioned consistency policy tailored for bimanual manipulation.
It is a hierarchical imitation learning framework that consists of a high-level keypose predictor and a low-level trajectory generator.
Simulated and real-world experimental results demonstrate that the proposed approach surpasses baseline methods in terms of success rate and operational efficiency.
arXiv Detail & Related papers (2024-06-14T14:49:12Z) - Goal-conditioned dual-action imitation learning for dexterous dual-arm robot manipulation [4.717749411286867]
Long-conditioned dexterous robot manipulation of deformable objects, such as banana peeling, is a problematic task.<n>This paper presents a goal-conditioned dual-action deep imitation learning (DIL) approach that can learn dexterous manipulation skills.
arXiv Detail & Related papers (2022-03-18T05:17:00Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.