AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots
- URL: http://arxiv.org/abs/2409.11905v1
- Date: Wed, 18 Sep 2024 12:05:30 GMT
- Title: AlignBot: Aligning VLM-powered Customized Task Planning with User Reminders Through Fine-Tuning for Household Robots
- Authors: Zhaxizhuoma, Pengan Chen, Ziniu Wu, Jiawei Sun, Dong Wang, Peng Zhou, Nieqing Cao, Yan Ding, Bin Zhao, Xuelong Li,
- Abstract summary: AlignBot is a novel framework designed to optimize VLM-powered customized task planning for household robots.
In domestic settings, aligning task planning with user reminders poses significant challenges due to the limited quantity, diversity, and multimodal nature of the reminders.
- Score: 44.47999496605951
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents AlignBot, a novel framework designed to optimize VLM-powered customized task planning for household robots by effectively aligning with user reminders. In domestic settings, aligning task planning with user reminders poses significant challenges due to the limited quantity, diversity, and multimodal nature of the reminders. To address these challenges, AlignBot employs a fine-tuned LLaVA-7B model, functioning as an adapter for GPT-4o. This adapter model internalizes diverse forms of user reminders-such as personalized preferences, corrective guidance, and contextual assistance-into structured instruction-formatted cues that prompt GPT-4o in generating customized task plans. Additionally, AlignBot integrates a dynamic retrieval mechanism that selects task-relevant historical successes as prompts for GPT-4o, further enhancing task planning accuracy. To validate the effectiveness of AlignBot, experiments are conducted in real-world household environments, which are constructed within the laboratory to replicate typical household settings. A multimodal dataset with over 1,500 entries derived from volunteer reminders is used for training and evaluation. The results demonstrate that AlignBot significantly improves customized task planning, outperforming existing LLM- and VLM-powered planners by interpreting and aligning with user reminders, achieving 86.8% success rate compared to the vanilla GPT-4o baseline at 21.6%, reflecting a 65% improvement and over four times greater effectiveness. Supplementary materials are available at: https://yding25.com/AlignBot/
Related papers
- GRAPE: Generalizing Robot Policy via Preference Alignment [58.419992317452376]
We present GRAPE: Generalizing Robot Policy via Preference Alignment.
We show GRAPE increases success rates on in-domain and unseen manipulation tasks by 51.79% and 58.20%, respectively.
GRAPE can be aligned with various objectives, such as safety and efficiency, reducing collision rates by 37.44% and rollout step-length by 11.15%, respectively.
arXiv Detail & Related papers (2024-11-28T18:30:10Z) - SELP: Generating Safe and Efficient Task Plans for Robot Agents with Large Language Models [24.22168861692322]
We present three key insights, equivalence voting, constrained decoding, and domain-specific fine-tuning.
Equivalence voting ensures consistency by generating and sampling multiple Linear Temporal Logic (LTL) formulas.
Constrained decoding then uses the generated formula to enforce the autoregressive inference of plans.
Domain-specific fine-tuning customizes LLMs to produce safe and efficient plans within specific task domains.
arXiv Detail & Related papers (2024-09-28T22:33:44Z) - Enhancing Supermarket Robot Interaction: A Multi-Level LLM Conversational Interface for Handling Diverse Customer Intents [46.623273455512106]
This paper presents the design and evaluation of a novel multi-level LLM interface for supermarket robots.
We compare this approach to a specialized GPT model powered by GPT-4 Turbo.
We find statistically significant improvements in four key areas: performance, user satisfaction, user-agent partnership, and self-image enhancement.
arXiv Detail & Related papers (2024-06-16T19:13:01Z) - Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V [38.80155683176581]
We present COME-robot, the first closed-loop framework for autonomous robot navigation and manipulation in open environments.
We meticulously construct a library of action primitives for robot exploration, navigation, and manipulation, serving as callable execution modules for GPT-4V in task planning.
We conduct comprehensive analyses to elucidate how COME-robot's design facilitates failure recovery, free-form instruction following, and long-horizon task planning.
arXiv Detail & Related papers (2024-04-16T02:01:56Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation [8.158152532619576]
This paper introduces MobileGPT, an innovative mobile task automator equipped with a human-like app memory.
MobileGPT emulates the cognitive process of humans interacting with a mobile app -- explore, select, derive, and recall.
We implement MobileGPT using online LLMs services (GPT-3.5 and GPT-4) and evaluate its performance on a dataset of 185 tasks across 18 mobile apps.
arXiv Detail & Related papers (2023-12-04T06:13:35Z) - AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot
Manipulation [50.737355245505334]
We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks.
The resulting dataset AlphaBlock consists of 35 comprehensive high-level tasks of multi-step text plans and paired observation.
arXiv Detail & Related papers (2023-05-30T09:54:20Z) - EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought [95.37585041654535]
Embodied AI is capable of planning and executing action sequences for robots to accomplish long-horizon tasks in physical environments.
In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI.
Experiments show the effectiveness of EmbodiedGPT on embodied tasks, including embodied planning, embodied control, visual captioning, and visual question answering.
arXiv Detail & Related papers (2023-05-24T11:04:30Z) - AutoML-GPT: Automatic Machine Learning with GPT [74.30699827690596]
We propose developing task-oriented prompts and automatically utilizing large language models (LLMs) to automate the training pipeline.
We present the AutoML-GPT, which employs GPT as the bridge to diverse AI models and dynamically trains models with optimized hyper parameters.
This approach achieves remarkable results in computer vision, natural language processing, and other challenging areas.
arXiv Detail & Related papers (2023-05-04T02:09:43Z) - GAT: Guided Adversarial Training with Pareto-optimal Auxiliary Tasks [73.88590165742721]
We propose a novel adversarial training technique that exploits auxiliary tasks under a limited set of training data.
Our approach extends single-task models into multi-task models during the min-max optimization of adversarial training.
We demonstrate that guided multi-task learning is an actionable and promising avenue to push further the boundaries of model robustness.
arXiv Detail & Related papers (2023-02-06T16:23:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.