Enhancing Robotic Manipulation with AI Feedback from Multimodal Large
Language Models
- URL: http://arxiv.org/abs/2402.14245v1
- Date: Thu, 22 Feb 2024 03:14:03 GMT
- Title: Enhancing Robotic Manipulation with AI Feedback from Multimodal Large
Language Models
- Authors: Jinyi Liu, Yifu Yuan, Jianye Hao, Fei Ni, Lingzhi Fu, Yibin Chen, Yan
Zheng
- Abstract summary: Large language models (LLMs) can provide automated preference feedback solely from image inputs to guide decision-making.
In this study, we train a multimodal LLM, termed CriticGPT, capable of understanding trajectory videos in robot manipulation tasks.
Experimental evaluation of the algorithm's preference accuracy demonstrates its effective generalization ability to new tasks.
Performance on Meta-World tasks reveals that CriticGPT's reward model efficiently guides policy learning, surpassing rewards based on state-of-the-art pre-trained representation models.
- Score: 41.38520841504846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, there has been considerable attention towards leveraging large
language models (LLMs) to enhance decision-making processes. However, aligning
the natural language text instructions generated by LLMs with the vectorized
operations required for execution presents a significant challenge, often
necessitating task-specific details. To circumvent the need for such
task-specific granularity, inspired by preference-based policy learning
approaches, we investigate the utilization of multimodal LLMs to provide
automated preference feedback solely from image inputs to guide
decision-making. In this study, we train a multimodal LLM, termed CriticGPT,
capable of understanding trajectory videos in robot manipulation tasks, serving
as a critic to offer analysis and preference feedback. Subsequently, we
validate the effectiveness of preference labels generated by CriticGPT from a
reward modeling perspective. Experimental evaluation of the algorithm's
preference accuracy demonstrates its effective generalization ability to new
tasks. Furthermore, performance on Meta-World tasks reveals that CriticGPT's
reward model efficiently guides policy learning, surpassing rewards based on
state-of-the-art pre-trained representation models.
Related papers
- RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning [15.919493497867567]
This study aims to evaluate the performance of Multimodal Large Language Models (MLLMs) on the VALSE benchmark.
We conducted a comprehensive assessment of state-of-the-art MLLMs, varying in model size and pretraining datasets.
arXiv Detail & Related papers (2024-07-17T11:26:47Z) - Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning [53.6472920229013]
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks.
LLMs are prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning.
We introduce Q*, a framework for guiding LLMs decoding process with deliberative planning.
arXiv Detail & Related papers (2024-06-20T13:08:09Z) - Meta Reasoning for Large Language Models [58.87183757029041]
We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs)
MRP guides LLMs to dynamically select and apply different reasoning methods based on the specific requirements of each task.
We evaluate the effectiveness of MRP through comprehensive benchmarks.
arXiv Detail & Related papers (2024-06-17T16:14:11Z) - Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL)
Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning.
We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z) - Large Language Model Agent as a Mechanical Designer [7.136205674624813]
In this study, we present a novel approach that integrates pre-trained LLMs with a FEM module.
The FEM module evaluates each design and provides essential feedback, guiding the LLMs to continuously learn, plan, generate, and optimize designs without the need for domain-specific training.
Our results reveal that these LLM-based agents can successfully generate truss designs that comply with natural language specifications with a success rate of up to 90%, which varies according to the applied constraints.
arXiv Detail & Related papers (2024-04-26T16:41:24Z) - On the Performance of Multimodal Language Models [4.677125897916577]
This study conducts a comparative analysis of different multimodal instruction tuning approaches.
We reveal key insights for guiding architectural choices when incorporating multimodal capabilities into large language models.
arXiv Detail & Related papers (2023-10-04T23:33:36Z) - Unlocking the Potential of User Feedback: Leveraging Large Language
Model as User Simulator to Enhance Dialogue System [65.93577256431125]
We propose an alternative approach called User-Guided Response Optimization (UGRO) to combine it with a smaller task-oriented dialogue model.
This approach uses LLM as annotation-free user simulator to assess dialogue responses, combining them with smaller fine-tuned end-to-end TOD models.
Our approach outperforms previous state-of-the-art (SOTA) results.
arXiv Detail & Related papers (2023-06-16T13:04:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.