MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
- URL: http://arxiv.org/abs/2401.10727v2
- Date: Wed, 24 Jan 2024 03:34:41 GMT
- Title: MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
- Authors: Chenyu Wang, Weixin Luo, Qianyu Chen, Haonan Mai, Jindi Guo, Sixun
Dong, Xiaohua (Michael) Xuan, Zhengxin Li, Lin Ma, Shenghua Gao
- Abstract summary: We propose MLLM-Tool, a system incorporating open-source large language models and multi-modal encoders.
The learnt LLMs can be conscious of multi-modal input instruction and then select the function-matched tool correctly.
Experiments reveal that our MLLM-Tool is capable of recommending appropriate tools for multi-modal instructions.
- Score: 38.610185966889226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the astonishing performance of large language models (LLMs) in
natural language comprehension and generation tasks triggered lots of
exploration of using them as central controllers to build agent systems.
Multiple studies focus on bridging the LLMs to external tools to extend the
application scenarios. However, the current LLMs' perceiving tool-use ability
is limited to a single text query, which may result in ambiguity in
understanding the users' real intentions. LLMs are expected to eliminate that
by perceiving the visual- or auditory-grounded instructions' information.
Therefore, in this paper, we propose MLLM-Tool, a system incorporating
open-source LLMs and multi-modal encoders so that the learnt LLMs can be
conscious of multi-modal input instruction and then select the function-matched
tool correctly. To facilitate the evaluation of the model's capability, we
collect a dataset featured by consisting of multi-modal input tools from
HuggingFace. Another important feature of our dataset is that our dataset also
contains multiple potential choices for the same instruction due to the
existence of identical functions and synonymous functions, which provides more
potential solutions for the same query. The experiments reveal that our
MLLM-Tool is capable of recommending appropriate tools for multi-modal
instructions. Codes and data are available at
https://github.com/MLLM-Tool/MLLM-Tool.
Related papers
- Chain of Tools: Large Language Model is an Automatic Multi-tool Learner [54.992464510992605]
Automatic Tool Chain (ATC) is a framework that enables the large language models (LLMs) to act as a multi-tool user.
To scale up the scope of the tools, we next propose a black-box probing method.
For a comprehensive evaluation, we build a challenging benchmark named ToolFlow.
arXiv Detail & Related papers (2024-05-26T11:40:58Z) - LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [54.954211216847135]
Existing large language models (LLMs) only reach a correctness rate in the range of 30% to 60%.
We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE)
STE orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory.
arXiv Detail & Related papers (2024-03-07T18:50:51Z) - Look Before You Leap: Towards Decision-Aware and Generalizable Tool-Usage for Large Language Models [26.28459880766842]
We propose a decision-aware and generalizable tool-usage framework (DEER)
Specifically, we first construct the tool-usage samples with multiple decision branches via an automatic generation pipeline.
Our proposed DEER is effective and significantly outperforms baselines across various datasets.
arXiv Detail & Related papers (2024-02-26T16:11:03Z) - Small LLMs Are Weak Tool Learners: A Multi-LLM Agent [73.54562551341454]
Large Language Model (LLM) agents significantly extend the capabilities of standalone LLMs.
We propose a novel approach that decomposes the aforementioned capabilities into a planner, caller, and summarizer.
This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability.
arXiv Detail & Related papers (2024-01-14T16:17:07Z) - EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction [56.02100384015907]
EasyTool is a framework transforming diverse and lengthy tool documentation into a unified and concise tool instruction.
It can significantly reduce token consumption and improve the performance of tool utilization in real-world scenarios.
arXiv Detail & Related papers (2024-01-11T15:45:11Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - MetaTool Benchmark for Large Language Models: Deciding Whether to Use
Tools and Which to Use [82.24774504584066]
Large language models (LLMs) have garnered significant attention due to their impressive natural language processing (NLP) capabilities.
We introduce MetaTool, a benchmark designed to evaluate whether LLMs have tool usage awareness and can correctly choose tools.
We conduct experiments involving eight popular LLMs and find that the majority of them still struggle to effectively select tools.
arXiv Detail & Related papers (2023-10-04T19:39:26Z) - GPT4Tools: Teaching Large Language Model to Use Tools via
Self-instruction [41.36474802204914]
GPT4Tools is based on self-instruct to enable open-source LLMs, such as LLaMA and OPT, to use tools.
It generates an instruction-following dataset by prompting an advanced teacher with various multi-modal contexts.
arXiv Detail & Related papers (2023-05-30T05:27:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.