Related papers: Is Diversity All You Need for Scalable Robotic Manipulation?

Is Diversity All You Need for Scalable Robotic Manipulation?

URL: http://arxiv.org/abs/2507.06219v1
Date: Tue, 08 Jul 2025 17:52:44 GMT
Title: Is Diversity All You Need for Scalable Robotic Manipulation?
Authors: Modi Shi, Li Chen, Jin Chen, Yuxiang Lu, Chiming Liu, Guanghui Ren, Ping Luo, Di Huang, Maoqing Yao, Hongyang Li,
Abstract summary: We investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better"<n>We show that task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios.<n>We propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data.
Score: 50.747150672933316
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Data scaling has driven remarkable success in foundation models for Natural Language Processing (NLP) and Computer Vision (CV), yet the principles of effective data scaling in robotic manipulation remain insufficiently understood. In this work, we investigate the nuanced role of data diversity in robot learning by examining three critical dimensions-task (what to do), embodiment (which robot to use), and expert (who demonstrates)-challenging the conventional intuition of "more diverse is better". Throughout extensive experiments on various robot platforms, we reveal that (1) task diversity proves more critical than per-task demonstration quantity, benefiting transfer from diverse pre-training tasks to novel downstream scenarios; (2) multi-embodiment pre-training data is optional for cross-embodiment transfer-models trained on high-quality single-embodiment data can efficiently transfer to different platforms, showing more desirable scaling property during fine-tuning than multi-embodiment pre-trained models; and (3) expert diversity, arising from individual operational preferences and stochastic variations in human demonstrations, can be confounding to policy learning, with velocity multimodality emerging as a key contributing factor. Based on this insight, we propose a distribution debiasing method to mitigate velocity ambiguity, the yielding GO-1-Pro achieves substantial performance gains of 15%, equivalent to using 2.5 times pre-training data. Collectively, these findings provide new perspectives and offer practical guidance on how to scale robotic manipulation datasets effectively.

Related papers

RoboGene: Boosting VLA Pre-training via Diversity-Driven Agentic Framework for Real-World Task Generation [37.52152452548065]
RoboGene is an agentic framework designed to automate the generation of diverse, physically plausible manipulation tasks.<n>We conduct extensive quantitative analysis and large-scale real-world experiments, collecting datasets of 18k trajectories.<n>Results demonstrate that RoboGene significantly outperforms state-of-the-art foundation models.
arXiv Detail & Related papers (2026-02-18T13:29:43Z)
H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation [27.585828712261232]
H-RDT (Human to Robotics Diffusion Transformer) is a novel approach that leverages human manipulation data to enhance robot manipulation capabilities.<n>Our key insight is that large-scale egocentric human manipulation videos with paired 3D hand pose annotations provide rich behavioral priors that capture natural manipulation strategies.<n>We introduce a two-stage training paradigm: (1) pre-training on large-scale egocentric human manipulation data, and (2) cross-embodiment fine-tuning on robot-specific data with modular action encoders and decoders.
arXiv Detail & Related papers (2025-07-31T13:06:59Z)
Sample Efficient Robot Learning in Supervised Effect Prediction Tasks [0.0]
MUSEL (Model Uncertainty for Sample-Efficient Learning) is a novel AL framework tailored for regression tasks in robotics.<n>We show that MUSEL improves both learning accuracy and sample efficiency, validating its effectiveness in learning action effects selecting informative samples.
arXiv Detail & Related papers (2024-12-03T09:48:28Z)
SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation [82.61572106180705]
This paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories. We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data. Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates.
arXiv Detail & Related papers (2024-09-26T17:26:16Z)
Scaling Cross-Embodied Learning: One Policy for Manipulation, Navigation, Locomotion and Aviation [49.03165169369552]
By training a single policy across many different kinds of robots, a robot learning method can leverage much broader and more diverse datasets. We propose CrossFormer, a scalable and flexible transformer-based policy that can consume data from any embodiment. We demonstrate that the same network weights can control vastly different robots, including single and dual arm manipulation systems, wheeled robots, quadcopters, and quadrupeds.
arXiv Detail & Related papers (2024-08-21T17:57:51Z)
Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation [16.809190349155525]
We propose a novel adaptation paradigm that leverages readily available paired human-robot video data to bridge the domain gap.<n>Our method employs a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robot domain in a parameter-efficient manner.
arXiv Detail & Related papers (2024-06-20T11:57:46Z)
Active Exploration in Bayesian Model-based Reinforcement Learning for Robot Manipulation [8.940998315746684]
We propose a model-based reinforcement learning (RL) approach for robotic arm end-tasks. We employ Bayesian neural network models to represent, in a probabilistic way, both the belief and information encoded in the dynamic model during exploration. Our experiments show the advantages of our Bayesian model-based RL approach, with similar quality in the results than relevant alternatives.
arXiv Detail & Related papers (2024-04-02T11:44:37Z)
Robot Fine-Tuning Made Easy: Pre-Training Rewards and Policies for Autonomous Real-World Reinforcement Learning [58.3994826169858]
We introduce RoboFuME, a reset-free fine-tuning system for robotic reinforcement learning. Our insights are to utilize offline reinforcement learning techniques to ensure efficient online fine-tuning of a pre-trained policy. Our method can incorporate data from an existing robot dataset and improve on a target task within as little as 3 hours of autonomous real-world experience.
arXiv Detail & Related papers (2023-10-23T17:50:08Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning [33.88636835443266]
We propose a framework to better scale up robot learning under the lens of multi-task, multi-scene robot manipulation in kitchen environments. Our framework, named CACTI, has four stages that separately handle data collection, data augmentation, visual representation learning, and imitation policy training. In the CACTI framework, we highlight the benefit of adapting state-of-the-art models for image generation as part of the augmentation stage.
arXiv Detail & Related papers (2022-12-12T05:30:08Z)
What Matters in Learning from Offline Human Demonstrations for Robot Manipulation [64.43440450794495]
We conduct an extensive study of six offline learning algorithms for robot manipulation. Our study analyzes the most critical challenges when learning from offline human data. We highlight opportunities for learning from human datasets.
arXiv Detail & Related papers (2021-08-06T20:48:30Z)
Scalable Multi-Task Imitation Learning with Autonomous Improvement [159.9406205002599]
We build an imitation learning system that can continuously improve through autonomous data collection. We leverage the robot's own trials as demonstrations for tasks other than the one that the robot actually attempted. In contrast to prior imitation learning approaches, our method can autonomously collect data with sparse supervision for continuous improvement.
arXiv Detail & Related papers (2020-02-25T18:56:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.