Learning to Grasp Anything by Playing with Random Toys
- URL: http://arxiv.org/abs/2510.12866v1
- Date: Tue, 14 Oct 2025 17:56:10 GMT
- Title: Learning to Grasp Anything by Playing with Random Toys
- Authors: Dantong Niu, Yuvan Sharma, Baifeng Shi, Rachel Ding, Matteo Gioia, Haoru Xue, Henry Tsai, Konstantinos Kallidromitis, Anirudh Pai, Shankar Shastry, Trevor Darrell, Jitendra Malik, Roei Herzig,
- Abstract summary: We show that robots can learn generalizable grasping using randomly assembled objects.<n>We find the key to this generalization is an object-centric visual representation induced by our proposed detection pooling mechanism.<n>We believe this work offers a promising path to scalable and generalizable learning in robotic manipulation.
- Score: 65.47078295823074
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Robotic manipulation policies often struggle to generalize to novel objects, limiting their real-world utility. In contrast, cognitive science suggests that children develop generalizable dexterous manipulation skills by mastering a small set of simple toys and then applying that knowledge to more complex items. Inspired by this, we study if similar generalization capabilities can also be achieved by robots. Our results indicate robots can learn generalizable grasping using randomly assembled objects that are composed from just four shape primitives: spheres, cuboids, cylinders, and rings. We show that training on these "toys" enables robust generalization to real-world objects, yielding strong zero-shot performance. Crucially, we find the key to this generalization is an object-centric visual representation induced by our proposed detection pooling mechanism. Evaluated in both simulation and on physical robots, our model achieves a 67% real-world grasping success rate on the YCB dataset, outperforming state-of-the-art approaches that rely on substantially more in-domain data. We further study how zero-shot generalization performance scales by varying the number and diversity of training toys and the demonstrations per toy. We believe this work offers a promising path to scalable and generalizable learning in robotic manipulation. Demonstration videos, code, checkpoints and our dataset are available on our project page: https://lego-grasp.github.io/ .
Related papers
- Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation [14.013652439013692]
This paper presents a new paradigm, HERO, for object loco-manipulation with humanoid robots.<n>We achieve this by designing an accurate residual-aware EE tracking policy.<n>We use this accurate end-effector tracker to build a modular system for loco-manipulation.
arXiv Detail & Related papers (2026-02-18T18:55:02Z) - $π_{0.5}$: a Vision-Language-Action Model with Open-World Generalization [81.73746512639283]
We describe a new model based on $pi_0.5$ that uses co-training on heterogeneous tasks to enable broad generalization.<n>We demonstrate for the first time that an end-to-end learning-enabled robotic system can perform long-horizon and dexterous manipulation skills.
arXiv Detail & Related papers (2025-04-22T17:31:29Z) - Disentangled Object-Centric Image Representation for Robotic Manipulation [6.775909411692767]
We propose DOCIR, an object-centric framework that introduces a disentangled representation for objects of interest, obstacles, and robot embodiment.<n>We show that this approach leads to state-of-the-art performance for learning pick and place skills from visual inputs in multi-object environments.
arXiv Detail & Related papers (2025-03-14T16:33:48Z) - Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination [25.62602420895531]
DreMa is a new approach for constructing digital twins using learned explicit representations of the real world and its dynamics.<n>We show that DreMa can successfully learn novel physical tasks from just a single example per task variation.
arXiv Detail & Related papers (2024-12-19T15:38:15Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Scaling Robot Learning with Semantically Imagined Experience [21.361979238427722]
Recent advances in robot learning have shown promise in enabling robots to perform manipulation tasks.
One of the key contributing factors to this progress is the scale of robot data used to train the models.
We propose an alternative route and leverage text-to-image foundation models widely used in computer vision and natural language processing.
arXiv Detail & Related papers (2023-02-22T18:47:51Z) - RT-1: Robotics Transformer for Real-World Control at Scale [98.09428483862165]
We present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties.
We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks.
arXiv Detail & Related papers (2022-12-13T18:55:15Z) - DexTransfer: Real World Multi-fingered Dexterous Grasping with Minimal
Human Demonstrations [51.87067543670535]
We propose a robot-learning system that can take a small number of human demonstrations and learn to grasp unseen object poses.
We train a dexterous grasping policy that takes the point clouds of the object as input and predicts continuous actions to grasp objects from different initial robot states.
The policy learned from our dataset can generalize well on unseen object poses in both simulation and the real world.
arXiv Detail & Related papers (2022-09-28T17:51:49Z) - Learning Generalizable Dexterous Manipulation from Human Grasp
Affordance [11.060931225148936]
Dexterous manipulation with a multi-finger hand is one of the most challenging problems in robotics.
Recent progress in imitation learning has largely improved the sample efficiency compared to Reinforcement Learning.
We propose to learn dexterous manipulation using large-scale demonstrations with diverse 3D objects in a category.
arXiv Detail & Related papers (2022-04-05T16:26:22Z) - Generalization in Dexterous Manipulation via Geometry-Aware Multi-Task
Learning [108.08083976908195]
We show that policies learned by existing reinforcement learning algorithms can in fact be generalist.
We show that a single generalist policy can perform in-hand manipulation of over 100 geometrically-diverse real-world objects.
Interestingly, we find that multi-task learning with object point cloud representations not only generalizes better but even outperforms single-object specialist policies.
arXiv Detail & Related papers (2021-11-04T17:59:56Z) - Attribute-Based Robotic Grasping with One-Grasp Adaptation [9.255994599301712]
We introduce an end-to-end learning method of attribute-based robotic grasping with one-grasp adaptation capability.
Our approach fuses the embeddings of a workspace image and a query text using a gated-attention mechanism and learns to predict instance grasping affordances.
Experimental results in both simulation and the real world demonstrate that our approach achieves over 80% instance grasping success rate on unknown objects.
arXiv Detail & Related papers (2021-04-06T03:40:46Z) - Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human
Videos [59.58105314783289]
Domain-agnostic Video Discriminator (DVD) learns multitask reward functions by training a discriminator to classify whether two videos are performing the same task.
DVD can generalize by virtue of learning from a small amount of robot data with a broad dataset of human videos.
DVD can be combined with visual model predictive control to solve robotic manipulation tasks on a real WidowX200 robot in an unseen environment from a single human demo.
arXiv Detail & Related papers (2021-03-31T05:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.