ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning
- URL: http://arxiv.org/abs/2512.10946v1
- Date: Thu, 11 Dec 2025 18:59:46 GMT
- Title: ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning
- Authors: Wendi Chen, Han Xue, Yi Wang, Fangyuan Zhou, Jun Lv, Yang Jin, Shirun Tang, Chuan Wen, Cewu Lu,
- Abstract summary: We propose a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network.<n>We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens.<n>Experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines.
- Score: 52.86018040861575
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-level contact-rich manipulation relies on the distinct roles of two key modalities: vision provides spatially rich but temporally slow global context, while force sensing captures rapid, high-frequency local contact dynamics. Integrating these signals is challenging due to their fundamental frequency and informational disparities. In this work, we propose ImplicitRDP, a unified end-to-end visual-force diffusion policy that integrates visual planning and reactive force control within a single network. We introduce Structural Slow-Fast Learning, a mechanism utilizing causal attention to simultaneously process asynchronous visual and force tokens, allowing the policy to perform closed-loop adjustments at the force frequency while maintaining the temporal coherence of action chunks. Furthermore, to mitigate modality collapse where end-to-end models fail to adjust the weights across different modalities, we propose Virtual-target-based Representation Regularization. This auxiliary objective maps force feedback into the same space as the action, providing a stronger, physics-grounded learning signal than raw force prediction. Extensive experiments on contact-rich tasks demonstrate that ImplicitRDP significantly outperforms both vision-only and hierarchical baselines, achieving superior reactivity and success rates with a streamlined training pipeline. Code and videos will be publicly available at https://implicit-rdp.github.io.
Related papers
- AsyncVLA: An Asynchronous VLA for Fast and Robust Navigation on the Edge [49.66156306240961]
High latency breaks the control loop, rendering powerful models unsafe for real-time deployment.<n>We propose AsyncVLA, an asynchronous control framework that decouples semantic reasoning from reactive execution.<n>AsyncVLA achieves a 40% higher success rate than state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-13T21:31:19Z) - DynaPURLS: Dynamic Refinement of Part-aware Representations for Skeleton-based Zero-Shot Action Recognition [51.80782323686666]
We introduce textbfDynaPURLS, a unified framework that establishes robust, multi-scale visual-semantic correspondences.<n>Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics.<n>Experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art.
arXiv Detail & Related papers (2025-12-12T10:39:10Z) - Inference-Time Gaze Refinement for Micro-Expression Recognition: Enhancing Event-Based Eye Tracking with Motion-Aware Post-Processing [2.5465367830324905]
Event-based eye tracking holds significant promise for fine-grained cognitive state inference.<n>We introduce a model-agnostic, inference-time refinement framework to enhance the output of existing event-based gaze estimation models.
arXiv Detail & Related papers (2025-06-14T14:48:11Z) - ForceVLA: Enhancing VLA Models with a Force-aware MoE for Contact-rich Manipulation [62.58034332427291]
ForceVLA is a novel end-to-end manipulation framework.<n>It treats external force sensing as a first-class modality within VLA systems.
arXiv Detail & Related papers (2025-05-28T09:24:25Z) - Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation [58.95799126311524]
Humans can accomplish contact-rich tasks using vision and touch, with highly reactive capabilities such as fast response to external changes and adaptive control of contact forces.<n>Existing visual imitation learning approaches rely on action chunking to model complex behaviors.<n>We introduce TactAR, a low-cost teleoperation system that provides real-time tactile feedback through Augmented Reality.
arXiv Detail & Related papers (2025-03-04T18:58:21Z) - STAF: Sinusoidal Trainable Activation Functions for Implicit Neural Representation [7.2888019138115245]
Implicit Neural Representations (INRs) have emerged as a powerful framework for modeling continuous signals.<n>The spectral bias of ReLU-based networks is a well-established limitation, restricting their ability to capture fine-grained details in target signals.<n>We introduce Sinusoidal Trainable Functions Activation (STAF)<n>STAF inherently modulates its frequency components, allowing for self-adaptive spectral learning.
arXiv Detail & Related papers (2025-02-02T18:29:33Z) - Remote Inference over Dynamic Links via Adaptive Rate Deep Task-Oriented Vector Quantization [24.064287427162345]
We propose Adaptive Rate Task-Oriented Vector Quantization (ARTOVeQ), a learned compression mechanism that is tailored for remote inference over dynamic links.<n>We show that ARTOVeQ extends to support low-latency inference that is gradually refined via successive refinement principles.<n> Numerical results demonstrate that the proposed scheme yields remote deep inference that operates with multiple rates, supports a broad range of bit budgets, and facilitates rapid inference that gradually improves with more bits exchanged.
arXiv Detail & Related papers (2025-01-05T12:38:13Z) - Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP [34.88916568947695]
We propose a novel CLI framework to comprehend multi-temporal dynamics.<n>For the vision side, we propose an efficient Dynamic Crossshot Attention.<n>For the semantic side, we conduct text augmentation by constructing an Action Knowledge Graph.
arXiv Detail & Related papers (2024-12-13T06:30:52Z) - Leveraging Low-Rank and Sparse Recurrent Connectivity for Robust
Closed-Loop Control [63.310780486820796]
We show how a parameterization of recurrent connectivity influences robustness in closed-loop settings.
We find that closed-form continuous-time neural networks (CfCs) with fewer parameters can outperform their full-rank, fully-connected counterparts.
arXiv Detail & Related papers (2023-10-05T21:44:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.