Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model
- URL: http://arxiv.org/abs/2404.04619v1
- Date: Sat, 6 Apr 2024 12:51:00 GMT
- Title: Do We Really Need a Complex Agent System? Distill Embodied Agent into a Single Model
- Authors: Zhonghan Zhao, Ke Ma, Wenhao Chai, Xuan Wang, Kewei Chen, Dongxu Guo, Yanting Zhang, Hongwei Wang, Gaoang Wang,
- Abstract summary: We propose STEVE-2, a hierarchical knowledge distillation framework for open-ended embodied tasks.
After distillation, embodied agents can complete complex, open-ended tasks without additional expert guidance.
- Score: 15.558269067931374
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the power of large language models (LLMs), open-ended embodied agents can flexibly understand human instructions, generate interpretable guidance strategies, and output executable actions. Nowadays, Multi-modal Language Models~(MLMs) integrate multi-modal signals into LLMs, further bringing richer perception to entity agents and allowing embodied agents to perceive world-understanding tasks more delicately. However, existing works: 1) operate independently by agents, each containing multiple LLMs, from perception to action, resulting in gaps between complex tasks and execution; 2) train MLMs on static data, struggling with dynamics in open-ended scenarios; 3) input prior knowledge directly as prompts, suppressing application flexibility. We propose STEVE-2, a hierarchical knowledge distillation framework for open-ended embodied tasks, characterized by 1) a hierarchical system for multi-granular task division, 2) a mirrored distillation method for parallel simulation data, and 3) an extra expert model for bringing additional knowledge into parallel simulation. After distillation, embodied agents can complete complex, open-ended tasks without additional expert guidance, utilizing the performance and knowledge of a versatile MLM. Extensive evaluations on navigation and creation tasks highlight the superior performance of STEVE-2 in open-ended tasks, with $1.4 \times$ - $7.3 \times$ in performance.
Related papers
- Advancing Compositional LLM Reasoning with Structured Task Relations in Interactive Multimodal Communications [42.945657927971]
This paper presents a novel paradigm that accomplishes various IMAs using a single compositional LLM over wireless networks.<n>To tackle the first challenge, we propose ContextLoRA, a novel method that guides an LLM to learn the rich structured context among IMAs.<n>Experiments on three benchmarks show the superiority of the proposed ContextLoRA and ContextGear.
arXiv Detail & Related papers (2025-07-28T09:33:12Z) - Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration [63.90193684394165]
We introduce multi-agent cross-task experiential learning (MAEL), a novel framework that endows LLM-driven agents with explicit cross-task learning and experience accumulation.<n>During the experiential learning phase, we quantify the quality for each step in the task-solving workflow and store the resulting rewards.<n>During inference, agents retrieve high-reward, task-relevant experiences as few-shot examples to enhance the effectiveness of each reasoning step.
arXiv Detail & Related papers (2025-05-29T07:24:37Z) - EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM [8.3321872381107]
We introduce EMAC+, an Embodied Multimodal Agent that collaboratively integrates LLM and VLM.<n>Unlike existing methods, EMAC+ dynamically refines high-level textual plans using real-time feedback from a VLM executing low-level visual control tasks.<n>EMAC+ achieves superior task performance, against noisy observations, and efficient learning.
arXiv Detail & Related papers (2025-05-26T12:34:16Z) - AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z) - OmniNova:A General Multimodal Agent Framework [0.5439020425819]
Large Language Models (LLMs) with specialized tools presents new opportunities for intelligent automation systems.
We present OmniNova, a modular multi-agent automation framework that combines language models with specialized tools such as web search, crawling, and code execution capabilities.
arXiv Detail & Related papers (2025-03-25T19:21:01Z) - Enhancing Multi-Agent Systems via Reinforcement Learning with LLM-based Planner and Graph-based Policy [31.041340552853004]
Graph Collaboration MARL (LGC-MARL) is a framework that efficiently combines Large Language Models (LLMs) and Multi-Agent Reinforcement Learning (MARL)
LGC-MARL decomposes complex tasks into executable subtasks and achieves efficient collaboration among multiple agents through graph-based coordination.
Experimental results on the AI2-THOR simulation platform demonstrate the superior performance and scalability of LGC-MARL.
arXiv Detail & Related papers (2025-03-13T05:02:49Z) - UnifiedMLLM: Enabling Unified Representation for Multi-modal Multi-tasks With Large Language Model [11.885204227946549]
We propose a comprehensive model designed to represent various tasks using a unified representation.
Our model exhibits strong capabilities in comprehending the implicit intent of user instructions.
Our approach exhibits exceptional scalability and generality.
arXiv Detail & Related papers (2024-08-05T14:27:39Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [60.17448025069594]
We investigate the potential of Large Language Models to enhance multimodal representation in multimodal item-to-item recommendations.
One feasible method is the transfer of Multimodal Large Language Models (MLLMs) for representation tasks.
We propose a novel training framework, NoteLLM-2, specifically designed for multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - Meta-Task Planning for Language Agents [13.550774629515843]
Large language model-based agents (LLM agents) have emerged as a promising paradigm for achieving artificial general intelligence (AGI)
This paper introduces Meta-Task Planning (MTP), a zero-shot methodology for collaborative LLM-based multi-agent systems.
MTP achieved an average $sim40%$ success rate on TravelPlanner, significantly higher than the state-of-the-art (SOTA) baseline.
arXiv Detail & Related papers (2024-05-26T10:33:17Z) - Smurfs: Leveraging Multiple Proficiency Agents with Context-Efficiency for Tool Planning [14.635361844362794]
Smurfs' is a cutting-edge multi-agent framework designed to revolutionize the application of large language models.
Smurfs can enhance the model's ability to solve complex tasks at no additional cost.
arXiv Detail & Related papers (2024-05-09T17:49:04Z) - LLMBind: A Unified Modality-Task Integration Framework [38.95771765322677]
We introduce textbfLLMBind, a novel framework designed to unify a diverse array of multi-modal tasks.
By harnessing a Mixture-of-Experts (MoE) Large Language Model (LLM), LLMBind processes multi-modal inputs and generates task-specific tokens, enabling the invocation of corresponding models to accomplish tasks.
arXiv Detail & Related papers (2024-02-22T12:36:31Z) - Small LLMs Are Weak Tool Learners: A Multi-LLM Agent [73.54562551341454]
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs.
We propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer.
This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability.
arXiv Detail & Related papers (2024-01-14T16:17:07Z) - Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub [79.31134731122462]
We introduce OpenAct benchmark to evaluate the open-domain task-solving capability, built on human expert consultation and repositories in GitHub.<n>We present OpenAgent, a novel LLM-based agent system that can tackle evolving queries in open domains through autonomously integrating specialized tools from GitHub.
arXiv Detail & Related papers (2023-12-28T15:47:30Z) - TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation.
Specifically, task decomposition, tool selection, and parameter prediction are assessed.
Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z) - Agent Lumos: Unified and Modular Training for Open-Source Language Agents [89.78556964988852]
We introduce LUMOS, one of the first frameworks for training open-source LLM-based agents.
LUMOS features a learnable, unified, and modular architecture with a planning module that learns high-level subgoal generation.
We collect large-scale, unified, and high-quality training annotations derived from diverse ground-truth reasoning rationales.
arXiv Detail & Related papers (2023-11-09T00:30:13Z) - Recommender AI Agent: Integrating Large Language Models for Interactive
Recommendations [53.76682562935373]
We introduce an efficient framework called textbfInteRecAgent, which employs LLMs as the brain and recommender models as tools.
InteRecAgent achieves satisfying performance as a conversational recommender system, outperforming general-purpose LLMs.
arXiv Detail & Related papers (2023-08-31T07:36:44Z) - LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset,
Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark.
Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs.
We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.