λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics
- URL: http://arxiv.org/abs/2412.05313v6
- Date: Tue, 04 Mar 2025 17:33:11 GMT
- Title: λ: A Benchmark for Data-Efficiency in Long-Horizon Indoor Mobile Manipulation Robotics
- Authors: Ahmed Jaafar, Shreyas Sundara Raman, Yichen Wei, Sudarshan Harithas, Sofia Juliani, Anneke Wernerfelt, Benedict Quartey, Ifrah Idrees, Jason Xinyu Liu, Stefanie Tellex,
- Abstract summary: We introduce the LAMBDA benchmark-Long-horizon Actions for Mobile-manipulation Benchmarking of Directed Activities.<n>This benchmark evaluates the data efficiency of models on language-conditioned, long-horizon, multi-room, multi-floor, pick-and-place tasks.<n>Our benchmark includes 571 human-collected demonstrations that provide realism and diversity in simulated and real-world settings.
- Score: 11.901933884058021
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning to execute long-horizon mobile manipulation tasks is crucial for advancing robotics in household and workplace settings. However, current approaches are typically data-inefficient, underscoring the need for improved models that require realistically sized benchmarks to evaluate their efficiency. To address this, we introduce the LAMBDA ({\lambda}) benchmark-Long-horizon Actions for Mobile-manipulation Benchmarking of Directed Activities-which evaluates the data efficiency of models on language-conditioned, long-horizon, multi-room, multi-floor, pick-and-place tasks using a dataset of manageable size, more feasible for collection. Our benchmark includes 571 human-collected demonstrations that provide realism and diversity in simulated and real-world settings. Unlike planner-generated data, these trajectories offer natural variability and replay-verifiability, ensuring robust learning and evaluation. We leverage LAMBDA to benchmark current end-to-end learning methods and a modular neuro-symbolic approaches that combines foundation models with task and motion planning. We find that end-to-end methods-even when pretrained-yield lower success rates, while neuro-symbolic methods perform significantly better and require less data.
Related papers
- Action Flow Matching for Continual Robot Learning [57.698553219660376]
Continual learning in robotics seeks systems that can constantly adapt to changing environments and tasks.
We introduce a generative framework leveraging flow matching for online robot dynamics model alignment.
We find that by transforming the actions themselves rather than exploring with a misaligned model, the robot collects informative data more efficiently.
arXiv Detail & Related papers (2025-04-25T16:26:15Z) - EMMOE: A Comprehensive Benchmark for Embodied Mobile Manipulation in Open Environments [11.97783742296183]
We introduce Embodied Mobile Manipulation in Open Environments (EMMOE), which requires agents to interpret user instructions and execute long-horizon everyday tasks in continuous space.
EMMOE seamlessly integrates high-level and low-level embodied tasks into a unified framework, along with three new metrics for more diverse assessment.
Furthermore, we design HomieBot, a sophisticated agent system consists of LLM with Direct Optimization Preference (DPO), light weighted navigation and manipulation models, and multiple error detection mechanisms.
arXiv Detail & Related papers (2025-03-11T16:42:36Z) - A Dynamic and High-Precision Method for Scenario-Based HRA Synthetic Data Collection in Multi-Agent Collaborative Environments Driven by LLMs [2.50572897318757]
This study introduces a novel, scenario-driven method for workload estimation, leveraging fine-tuned large language models (LLMs)
The method dynamically adapts to changes in operator workload, providing more accurate, flexible, and scalable workload estimates.
arXiv Detail & Related papers (2025-01-16T09:23:48Z) - Affordance-based Robot Manipulation with Flow Matching [6.863932324631107]
We present a framework for assistive robot manipulation.
We tackle two challenges: first, efficiently adapting large-scale models to downstream scene affordance understanding tasks, and second, effectively learning robot action trajectories by grounding the visual affordance model.
We learn robot action trajectories guided by affordances in a supervised flow matching method.
arXiv Detail & Related papers (2024-09-02T09:11:28Z) - DiffGen: Robot Demonstration Generation via Differentiable Physics Simulation, Differentiable Rendering, and Vision-Language Model [72.66465487508556]
DiffGen is a novel framework that integrates differentiable physics simulation, differentiable rendering, and a vision-language model.
It can generate realistic robot demonstrations by minimizing the distance between the embedding of the language instruction and the embedding of the simulated observation.
Experiments demonstrate that with DiffGen, we could efficiently and effectively generate robot data with minimal human effort or training time.
arXiv Detail & Related papers (2024-05-12T15:38:17Z) - Bridging Language, Vision and Action: Multimodal VAEs in Robotic Manipulation Tasks [0.0]
In this work, we focus on unsupervised vision-language--action mapping in the area of robotic manipulation.
We propose a model-invariant training alternative that improves the models' performance in a simulator by up to 55%.
Our work thus also sheds light on the potential benefits and limitations of using the current multimodal VAEs for unsupervised learning of robotic motion trajectories.
arXiv Detail & Related papers (2024-04-02T13:25:16Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - LocoMuJoCo: A Comprehensive Imitation Learning Benchmark for Locomotion [20.545058017790428]
Imitation Learning holds great promise for enabling agile locomotion in embodied agents.
We present a novel benchmark designed to facilitate rigorous evaluation and comparison of IL algorithms.
This benchmark encompasses a diverse set of environments, including quadrupeds, bipeds, and musculoskeletal human models.
arXiv Detail & Related papers (2023-11-04T19:41:50Z) - Towards Robust Continual Learning with Bayesian Adaptive Moment Regularization [51.34904967046097]
Continual learning seeks to overcome the challenge of catastrophic forgetting, where a model forgets previously learnt information.
We introduce a novel prior-based method that better constrains parameter growth, reducing catastrophic forgetting.
Results show that BAdam achieves state-of-the-art performance for prior-based methods on challenging single-headed class-incremental experiments.
arXiv Detail & Related papers (2023-09-15T17:10:51Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - Value function estimation using conditional diffusion models for control [62.27184818047923]
We propose a simple algorithm called Diffused Value Function (DVF)
It learns a joint multi-step model of the environment-robot interaction dynamics using a diffusion model.
We show how DVF can be used to efficiently capture the state visitation measure for multiple controllers.
arXiv Detail & Related papers (2023-06-09T18:40:55Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Model-Agnostic Multitask Fine-tuning for Few-shot Vision-Language
Transfer Learning [59.38343286807997]
We propose Model-Agnostic Multitask Fine-tuning (MAMF) for vision-language models on unseen tasks.
Compared with model-agnostic meta-learning (MAML), MAMF discards the bi-level optimization and uses only first-order gradients.
We show that MAMF consistently outperforms the classical fine-tuning method for few-shot transfer learning on five benchmark datasets.
arXiv Detail & Related papers (2022-03-09T17:26:53Z) - Can Deep Learning be Applied to Model-Based Multi-Object Tracking? [25.464269324261636]
Multi-object tracking (MOT) is the problem of tracking the state of an unknown and time-varying number of objects using noisy measurements.
Deep learning (DL) has been increasingly used in MOT for improving tracking performance.
In this paper, we propose a Transformer-based DL tracker and evaluate its performance in the model-based setting.
arXiv Detail & Related papers (2022-02-16T07:43:08Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic
Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis.
The proposed dataset contains 100,000 images and 25 different object types.
We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z) - Few-Shot Visual Grounding for Natural Human-Robot Interaction [0.0]
We propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user.
At the core of our system, we employ a multi-modal deep neural network for visual grounding.
We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets.
arXiv Detail & Related papers (2021-03-17T15:24:02Z) - BREEDS: Benchmarks for Subpopulation Shift [98.90314444545204]
We develop a methodology for assessing the robustness of models to subpopulation shift.
We leverage the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions.
Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity.
arXiv Detail & Related papers (2020-08-11T17:04:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.