Related papers: Prepare Before You Act: Learning From Humans to Rearrange Initial States

Prepare Before You Act: Learning From Humans to Rearrange Initial States

URL: http://arxiv.org/abs/2509.18043v1
Date: Mon, 22 Sep 2025 17:18:52 GMT
Title: Prepare Before You Act: Learning From Humans to Rearrange Initial States
Authors: Yinlong Dai, Andre Keyser, Dylan P. Losey,
Abstract summary: Imitation learning (IL) has proven effective across a wide range of manipulation tasks.<n>We propose ReSET, an algorithm that takes initial states and autonomously modifies object poses so that the restructured scene is similar to training data.
Score: 4.637185817866919
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Imitation learning (IL) has proven effective across a wide range of manipulation tasks. However, IL policies often struggle when faced with out-of-distribution observations; for instance, when the target object is in a previously unseen position or occluded by other objects. In these cases, extensive demonstrations are needed for current IL methods to reach robust and generalizable behaviors. But when humans are faced with these sorts of atypical initial states, we often rearrange the environment for more favorable task execution. For example, a person might rotate a coffee cup so that it is easier to grasp the handle, or push a box out of the way so they can directly grasp their target object. In this work we seek to equip robot learners with the same capability: enabling robots to prepare the environment before executing their given policy. We propose ReSET, an algorithm that takes initial states -- which are outside the policy's distribution -- and autonomously modifies object poses so that the restructured scene is similar to training data. Theoretically, we show that this two step process (rearranging the environment before rolling out the given policy) reduces the generalization gap. Practically, our ReSET algorithm combines action-agnostic human videos with task-agnostic teleoperation data to i) decide when to modify the scene, ii) predict what simplifying actions a human would take, and iii) map those predictions into robot action primitives. Comparisons with diffusion policies, VLAs, and other baselines show that using ReSET to prepare the environment enables more robust task execution with equal amounts of total training data. See videos at our project website: https://reset2025paper.github.io/

Related papers

DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy [33.18108154271181]
We propose DemoDiffusion, a simple and scalable method for enabling robots to perform manipulation tasks in natural environments.<n>Our approach is based on two key insights. First, the hand motion in a human demonstration provides a useful prior for the robot's end-effector trajectory.<n>Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context.
arXiv Detail & Related papers (2025-06-25T17:59:01Z)
Efficient Alignment of Unconditioned Action Prior for Language-conditioned Pick and Place in Clutter [59.69563889773648]
We study the task of language-conditioned pick and place in clutter, where a robot should grasp a target object in open clutter and move it to a specified place.<n>Some approaches learn end-to-end policies with features from vision foundation models, requiring large datasets.<n>We propose an action prior alignment method that aligns unconditioned action priors with 3D vision-language priors by learning one attention layer.
arXiv Detail & Related papers (2025-03-12T14:20:33Z)
P3-PO: Prescriptive Point Priors for Visuo-Spatial Generalization of Robot Policies [19.12762500264209]
Prescriptive Point Priors for Policies or P3-PO is a novel framework that constructs a unique state representation of the environment.<n>P3-PO exhibits 58% and 80% gains across tasks for new object instances and more cluttered environments respectively.
arXiv Detail & Related papers (2024-12-09T18:59:42Z)
Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers [41.069074375686164]
We propose Heterogeneous Pre-trained Transformers (HPT), which pre-train a trunk of a policy neural network to learn a task and embodiment shared representation. We conduct experiments to investigate the scaling behaviors of training objectives, to the extent of 52 datasets. HPTs outperform several baselines and enhance the fine-tuned policy performance by over 20% on unseen tasks.
arXiv Detail & Related papers (2024-09-30T17:39:41Z)
Hand-Object Interaction Pretraining from Videos [77.92637809322231]
We learn general robot manipulation priors from 3D hand-object interaction trajectories. We do so by sharing both the human hand and the manipulated object in 3D space and human motions to robot actions. We empirically demonstrate that finetuning this policy, with both reinforcement learning (RL) and behavior cloning (BC), enables sample-efficient adaptation to downstream tasks and simultaneously improves robustness and generalizability compared to prior approaches.
arXiv Detail & Related papers (2024-09-12T17:59:07Z)
Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction. The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z)
Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal. We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z)
Policy Adaptation from Foundation Model Feedback [31.5870515250885]
Recent progress on vision-language foundation models have brought significant advancement to building general-purpose robots. By using the pre-trained models to encode the scene and instructions as inputs for decision making, the instruction-conditioned policy can generalize across different objects and tasks. In this work, we propose Policy Adaptation from Foundation model Feedback (PAFF) We show PAFF improves baselines by a large margin in all cases.
arXiv Detail & Related papers (2022-12-14T18:31:47Z)
Learning Representations that Enable Generalization in Assistive Tasks [45.62648124988644]
We focus on enabling generalization in assistive tasks in which the robot is acting to assist a user. We find that sim2real methods that encode environment (or population) parameters and work well in tasks that robots do in isolation, do not work well in assistance.
arXiv Detail & Related papers (2022-12-05T18:59:16Z)
Learning What To Do by Simulating the Past [76.86449554580291]
We show that by combining a learned feature encoder with learned inverse models, we can enable agents to simulate human actions backwards in time to infer what they must have done. The resulting algorithm is able to reproduce a specific skill in MuJoCo environments given a single state sampled from the optimal policy for that skill.
arXiv Detail & Related papers (2021-04-08T17:43:29Z)
COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning [78.13740204156858]
We show that we can reuse prior data to extend new skills simply through dynamic programming. We demonstrate the effectiveness of our approach by chaining together several behaviors seen in prior datasets for solving a new task. We train our policies in an end-to-end fashion, mapping high-dimensional image observations to low-level robot control commands.
arXiv Detail & Related papers (2020-10-27T17:57:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.