Related papers: Movement Primitive Diffusion: Learning Gentle Robotic Manipulation of Deformable Objects

Related papers

Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos [56.510263910611684]
We tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions.<n>Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors.<n>We present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data.
arXiv Detail & Related papers (2026-02-13T18:59:10Z)
DynaRend: Learning 3D Dynamics via Masked Future Rendering for Robotic Manipulation [52.136378691610524]
We present DynaRend, a representation learning framework that learns 3D-aware and dynamics-informed triplane features.<n>By pretraining on multi-view RGB-D video data, DynaRend jointly captures spatial geometry, future dynamics, and task semantics in a unified triplane representation.<n>We evaluate DynaRend on two challenging benchmarks, RLBench and Colosseum, demonstrating substantial improvements in policy success rate, generalization to environmental perturbations, and real-world applicability across diverse manipulation tasks.
arXiv Detail & Related papers (2025-10-28T10:17:11Z)
FreeAction: Training-Free Techniques for Enhanced Fidelity of Trajectory-to-Video Generation [50.39748673817223]
We introduce two training-free, inference-time techniques that fully exploit explicit action parameters in robot video generation.<n>First, action-scaled classifier-free guidance dynamically modulates guidance strength in proportion to action magnitude, enhancing controllability over motion intensity.<n>Second, action-scaled noise truncation adjusts the distribution of initially sampled noise to better align with the desired motion dynamics.
arXiv Detail & Related papers (2025-09-29T03:30:40Z)
Joint Flow Trajectory Optimization For Feasible Robot Motion Generation from Video Demonstrations [8.133207162076877]
We propose a framework for grasp pose generation and object trajectory imitation under the video-based Learning-from-Demonstration (LfD) paradigm.<n>Rather than directly imitating human hand motions, our method treats demonstrations as object-centric guides.<n>We validate our approach in both simulation and real-world experiments across diverse real-world manipulation tasks.
arXiv Detail & Related papers (2025-09-25T03:11:07Z)
GenFlowRL: Shaping Rewards with Generative Object-Centric Flow in Visual Reinforcement Learning [79.68241687396603]
We propose GenFlowRL, which derives shaped rewards from generated flow trained from diverse cross-embodiment datasets.<n>Experiments on 10 manipulation tasks, both in simulation and real-world cross-embodiment evaluations, demonstrate that GenFlowRL effectively leverages manipulation features extracted from generated object-centric flow.
arXiv Detail & Related papers (2025-08-14T20:19:20Z)
Physical Autoregressive Model for Robotic Manipulation without Action Pretraining [65.8971623698511]
We build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR)<n>PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining.<n>Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task.
arXiv Detail & Related papers (2025-08-13T13:54:51Z)
AMPLIFY: Actionless Motion Priors for Robot Learning from Videos [29.799207502031496]
We introduce AMPLIFY, a novel framework that leverages large-scale video data.<n>We train a forward dynamics model on abundant action-free videos and an inverse dynamics model on a limited set of action-labeled examples.<n>In downstream policy learning, our dynamics predictions enable a 1.2-2.2x improvement in low-data regimes, a 1.4x average improvement by learning from action-free human videos, and the first generalization to LIBERO tasks from zero in-distribution action data.
arXiv Detail & Related papers (2025-06-17T05:31:42Z)
CoMo: Learning Continuous Latent Motion from Internet Videos for Scalable Robot Learning [47.195002937893115]
CoMo aims to learn more informative continuous motion representations from diverse, internet-scale videos.<n>We introduce two new metrics for more robustly and affordably evaluating motion and guiding motion learning methods.<n>CoMo exhibits strong zero-shot generalization, enabling it to generate continuous pseudo actions for previously unseen video domains.
arXiv Detail & Related papers (2025-05-22T17:58:27Z)
VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos. VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z)
Prediction with Action: Visual Policy Learning via Joint Denoising Process [14.588908033404474]
PAD is a visual policy learning framework that unifies image Prediction and robot Action. DiT seamlessly integrates images and robot states, enabling the simultaneous prediction of future images and robot actions. Pad outperforms previous methods, achieving a significant 26.3% relative improvement on the full Metaworld benchmark.
arXiv Detail & Related papers (2024-11-27T09:54:58Z)
VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency. Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z)
ReinDiffuse: Crafting Physically Plausible Motions with Reinforced Diffusion Model [9.525806425270428]
We present emphReinDiffuse that combines reinforcement learning with motion diffusion model to generate physically credible human motions. Our method adapts Motion Diffusion Model to output a parameterized distribution of actions, making them compatible with reinforcement learning paradigms. Our approach outperforms existing state-of-the-art models on two major datasets, HumanML3D and KIT-ML.
arXiv Detail & Related papers (2024-10-09T16:24:11Z)
DiffGen: Robot Demonstration Generation via Differentiable Physics Simulation, Differentiable Rendering, and Vision-Language Model [72.66465487508556]
DiffGen is a novel framework that integrates differentiable physics simulation, differentiable rendering, and a vision-language model. It can generate realistic robot demonstrations by minimizing the distance between the embedding of the language instruction and the embedding of the simulated observation. Experiments demonstrate that with DiffGen, we could efficiently and effectively generate robot data with minimal human effort or training time.
arXiv Detail & Related papers (2024-05-12T15:38:17Z)
Unsupervised Learning of Effective Actions in Robotics [0.9374652839580183]
Current state-of-the-art action representations in robotics lack proper effect-driven learning of the robot's actions. We propose an unsupervised algorithm to discretize a continuous motion space and generate "action prototypes" We evaluate our method on a simulated stair-climbing reinforcement learning task.
arXiv Detail & Related papers (2024-04-03T13:28:52Z)
Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame. ATM outperforms strong video pre-training baselines by 80% on average. We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z)
Few-Shot Learning of Force-Based Motions From Demonstration Through Pre-training of Haptic Representation [10.553635668779911]
Existing Learning from Demonstration (LfD) approaches require a large number of costly human demonstrations. Our proposed semi-supervised LfD approach decouples the learnt model into an haptic representation encoder and a motion generation decoder. This enables us to pre-train the first using large amount of unsupervised data, easily accessible, while using few-shot LfD to train the second. We validate the motion generated by our semi-supervised LfD model on the physical robot hardware using the KUKA iiwa robot arm.
arXiv Detail & Related papers (2023-09-08T23:42:59Z)
Dynamic-Resolution Model Learning for Object Pile Manipulation [33.05246884209322]
We investigate how to learn dynamic and adaptive representations at different levels of abstraction to achieve the optimal trade-off between efficiency and effectiveness. Specifically, we construct dynamic-resolution particle representations of the environment and learn a unified dynamics model using graph neural networks (GNNs) We show that our method achieves significantly better performance than state-of-the-art fixed-resolution baselines at the gathering, sorting, and redistribution of granular object piles.
arXiv Detail & Related papers (2023-06-29T05:51:44Z)
MILD: Multimodal Interactive Latent Dynamics for Learning Human-Robot Interaction [34.978017200500005]
We propose Multimodal Interactive Latent Dynamics (MILD) to address the problem of two-party physical Human-Robot Interactions (HRIs) We learn the interaction dynamics from demonstrations, using Hidden Semi-Markov Models (HSMMs) to model the joint distribution of the interacting agents in the latent space of a Variational Autoencoder (VAE) MILD generates more accurate trajectories for the controlled agent (robot) when conditioned on the observed agent's (human) trajectory.
arXiv Detail & Related papers (2022-10-22T11:25:11Z)
One to Many: Adaptive Instrument Segmentation via Meta Learning and Dynamic Online Adaptation in Robotic Surgical Video [71.43912903508765]
MDAL is a dynamic online adaptive learning scheme for instrument segmentation in robot-assisted surgery. It learns the general knowledge of instruments and the fast adaptation ability through the video-specific meta-learning paradigm. It outperforms other state-of-the-art methods on two datasets.
arXiv Detail & Related papers (2021-03-24T05:02:18Z)
Learning to Shift Attention for Motion Generation [55.61994201686024]
One challenge of motion generation using robot learning from demonstration techniques is that human demonstrations follow a distribution with multiple modes for one task query. Previous approaches fail to capture all modes or tend to average modes of the demonstrations and thus generate invalid trajectories. We propose a motion generation model with extrapolation ability to overcome this problem.
arXiv Detail & Related papers (2021-02-24T09:07:52Z)
Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots. We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector. We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.