DataMIL: Selecting Data for Robot Imitation Learning with Datamodels
- URL: http://arxiv.org/abs/2505.09603v1
- Date: Wed, 14 May 2025 17:55:10 GMT
- Title: DataMIL: Selecting Data for Robot Imitation Learning with Datamodels
- Authors: Shivin Dass, Alaa Khaddaj, Logan Engstrom, Aleksander Madry, Andrew Ilyas, Roberto Martín-Martín,
- Abstract summary: We introduce DataMIL, a policy-driven data selection framework built on the datamodels paradigm.<n>Unlike standard practices that filter data using human notions of quality, DataMIL directly optimize data selection for task success.<n>We validate our approach on a suite of more than 60 simulation and real-world manipulation tasks.
- Score: 77.48472034791213
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, the robotics community has amassed ever larger and more diverse datasets to train generalist robot policies. However, while these policies achieve strong mean performance across a variety of tasks, they often underperform on individual, specialized tasks and require further tuning on newly acquired task-specific data. Combining task-specific data with carefully curated subsets of large prior datasets via co-training can produce better specialized policies, but selecting data naively may actually harm downstream performance. To address this, we introduce DataMIL, a policy-driven data selection framework built on the datamodels paradigm that reasons about data selection in an end-to-end manner, using the policy itself to identify which data points will most improve performance. Unlike standard practices that filter data using human notions of quality (e.g., based on semantic or visual similarity), DataMIL directly optimizes data selection for task success, allowing us to select data that enhance the policy while dropping data that degrade it. To avoid performing expensive rollouts in the environment during selection, we use a novel surrogate loss function on task-specific data, allowing us to use DataMIL in the real world without degrading performance. We validate our approach on a suite of more than 60 simulation and real-world manipulation tasks - most notably showing successful data selection from the Open X-Embodiment datasets-demonstrating consistent gains in success rates and superior performance over multiple baselines. Our results underscore the importance of end-to-end, performance-aware data selection for unlocking the potential of large prior datasets in robotics. More information at https://robin-lab.cs.utexas.edu/datamodels4imitation/
Related papers
- Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap [13.89078939095465]
We introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism.<n>Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks.
arXiv Detail & Related papers (2025-08-06T07:24:14Z) - Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning [40.19639581728674]
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment.<n>We propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned.<n>Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$times$ speedup.
arXiv Detail & Related papers (2025-05-18T03:10:00Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.<n>Data selection has shown promise in identifying the most representative samples from the entire dataset.<n>We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Adapt-$\infty$: Scalable Continual Multimodal Instruction Tuning via Dynamic Data Selection [89.42023974249122]
Adapt-$infty$ is a new multi-way and adaptive data selection approach for lifelong instruction tuning.<n>We construct pseudo-skill clusters by grouping gradient-based sample vectors.<n>We select the best-performing data selector for each skill cluster from a pool of selector experts.<n>This data selector samples a subset of the most important samples from each skill cluster for training.
arXiv Detail & Related papers (2024-10-14T15:48:09Z) - Feature Selection from Differentially Private Correlations [35.187113265093615]
High-dimensional regression can leak information about individual datapoints in a dataset.
We employ a correlations-based order statistic to choose important features from a dataset and privatize them.
We find that our method significantly outperforms the established baseline for private feature selection on many datasets.
arXiv Detail & Related papers (2024-08-20T13:54:07Z) - Better Synthetic Data by Retrieving and Transforming Existing Datasets [63.875064274379824]
We introduce DataTune, a method to make better use of publicly available datasets to improve automatic dataset generation.
On a diverse set of language-based tasks, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49%.
We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks.
arXiv Detail & Related papers (2024-04-22T17:15:32Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Behavior Retrieval: Few-Shot Imitation Learning by Querying Unlabeled
Datasets [73.2096288987301]
We propose a simple approach that uses a small amount of downstream expert data to selectively query relevant behaviors from an offline, unlabeled dataset.
We observe that our method learns to query only the relevant transitions to the task, filtering out sub-optimal or task-irrelevant data.
Our simple querying approach outperforms more complex goal-conditioned methods by 20% across simulated and real robotic manipulation tasks from images.
arXiv Detail & Related papers (2023-04-18T05:42:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.