m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks
- URL: http://arxiv.org/abs/2403.11085v4
- Date: Sun, 22 Sep 2024 06:08:23 GMT
- Title: m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks
- Authors: Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, Ranjay Krishna,
- Abstract summary: We introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools.
For each of these task queries, we provide automatically generated plans using this realistic toolset.
We provide a high-quality subset of 1,565 task plans that are human-verified and correctly.
- Score: 31.031053149807857
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world multi-modal problems are rarely solved by a single machine learning model, and often require multi-step computational plans that involve stitching several models. Tool-augmented LLMs hold tremendous promise for automating the generation of such computational plans. However, the lack of standardized benchmarks for evaluating LLMs as planners for multi-step multi-modal tasks has prevented a systematic study of planner design decisions. Should LLMs generate a full plan in a single shot or step-by-step? Should they invoke tools directly with Python code or through structured data formats like JSON? Does feedback improve planning? To answer these questions and more, we introduce m&m's: a benchmark containing 4K+ multi-step multi-modal tasks involving 33 tools that include multi-modal models, (free) public APIs, and image processing modules. For each of these task queries, we provide automatically generated plans using this realistic toolset. We further provide a high-quality subset of 1,565 task plans that are human-verified and correctly executable. With m&m's, we evaluate 10 popular LLMs with 2 planning strategies (multi-step vs. step-by-step planning), 2 plan formats (JSON vs. code), and 3 types of feedback (parsing/verification/execution). Finally, we summarize takeaways from our extensive experiments. Our dataset and code are available on HuggingFace (https://huggingface.co/datasets/zixianma/mnms) and Github (https://github.com/RAIVNLab/mnms).
Related papers
- Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering [48.55956886819481]
We introduce a modular multi-LMM agent framework based on several agents with different roles.
Specifically, we propose TraveLER, a method that can create a plan to "Traverse" through the video.
We find that the proposed TraveLER approach improves performance on several VideoQA benchmarks without the need to fine-tune on specific datasets.
arXiv Detail & Related papers (2024-04-01T20:58:24Z) - Graph-enhanced Large Language Models in Asynchronous Plan Reasoning [18.402877904882107]
We find that large language models (LLMs) behave poorly when not supplied with illustrations about the task-solving process in our benchmark AsyncHow.
We propose a novel technique called Plan Like a Graph (PLaG) that combines graphs with natural language prompts and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-02-05T08:26:33Z) - Small LLMs Are Weak Tool Learners: A Multi-LLM Agent [73.54562551341454]
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs.
We propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer.
This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability.
arXiv Detail & Related papers (2024-01-14T16:17:07Z) - PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task
Completion [96.47420221442397]
We introduce the PowerPoint Task Completion benchmark to assess the ability of Large Language Models to finish multi-turn, multi-modal instructions.
We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence.
The results show that GPT-4 outperforms other LLMs with 75.1% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6% session accuracy.
arXiv Detail & Related papers (2023-11-03T08:06:35Z) - Improving Planning with Large Language Models: A Modular Agentic Architecture [7.63815864256878]
Large language models (LLMs) often struggle with tasks that require multi-step reasoning or goal-directed planning.
We propose an agentic architecture, the Modular Agentic Planner (MAP), in which planning is accomplished via the recurrent interaction of specialized modules.
We find that MAP yields significant improvements over both standard LLM methods.
arXiv Detail & Related papers (2023-09-30T00:10:14Z) - SheetCopilot: Bringing Software Productivity to the Next Level through
Large Language Models [60.171444066848856]
We propose a SheetCopilot agent that takes natural language task and control spreadsheet to fulfill the requirements.
We curate a representative dataset containing 221 spreadsheet control tasks and establish a fully automated evaluation pipeline.
Our SheetCopilot correctly completes 44.3% of tasks for a single generation, outperforming the strong code generation baseline by a wide margin.
arXiv Detail & Related papers (2023-05-30T17:59:30Z) - AdaPlanner: Adaptive Planning from Feedback with Language Models [56.367020818139665]
Large language models (LLMs) have recently demonstrated the potential in acting as autonomous agents for sequential decision-making tasks.
We propose a closed-loop approach, AdaPlanner, which allows the LLM agent to refine its self-generated plan adaptively in response to environmental feedback.
To mitigate hallucination, we develop a code-style LLM prompt structure that facilitates plan generation across a variety of tasks, environments, and agent capabilities.
arXiv Detail & Related papers (2023-05-26T05:52:27Z) - AutoPlan: Automatic Planning of Interactive Decision-Making Tasks With
Large Language Models [11.895111124804503]
AutoPlan is an approach to guide LLM-based agents to accomplish interactive decision-making tasks.
Our experiments show that AutoPlan achieves success rates on par with the baselines.
arXiv Detail & Related papers (2023-05-24T11:52:23Z) - Describe, Explain, Plan and Select: Interactive Planning with Large Language Models Enables Open-World Multi-Task Agents [26.78244595330595]
"$underlineD$escribe" is an interactive planning approach based on Large Language Models (LLMs)
DEPS facilitates better error correction on initial LLM-generated $textitplan$ by integrating $textitdescription$ of the plan execution process.
Experiments mark the milestone of the first zero-shot multi-task agent that can robustly accomplish 70+ Minecraft tasks.
arXiv Detail & Related papers (2023-02-03T06:06:27Z) - LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large
Language Models [27.318186938382233]
This study focuses on using large language models (LLMs) as a planner for embodied agents.
We propose a novel method, LLM-Planner, that harnesses the power of large language models to do few-shot planning.
arXiv Detail & Related papers (2022-12-08T05:46:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.