Related papers: Empowering Multimodal LLMs with External Tools: A Comprehensive Survey

Empowering Multimodal LLMs with External Tools: A Comprehensive Survey

URL: http://arxiv.org/abs/2508.10955v1
Date: Thu, 14 Aug 2025 07:25:45 GMT
Title: Empowering Multimodal LLMs with External Tools: A Comprehensive Survey
Authors: Wenbin An, Jiahao Nie, Yaqiang Wu, Feng Tian, Shijian Lu, Qinghua Zheng,
Abstract summary: Multimodal Large Language Models (MLLMs) have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence.<n>Lack of multimodal data, poor performance on many complex downstream tasks, and inadequate evaluation protocols hinder the reliability and broader applicability of MLLMs.<n>Inspired by the human ability to leverage external tools for enhanced reasoning and problem-solving, augmenting MLLMs with external tools offers a promising strategy to overcome these challenges.
Score: 61.66069828956139
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: By integrating the perception capabilities of multimodal encoders with the generative power of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs), exemplified by GPT-4V, have achieved great success in various multimodal tasks, pointing toward a promising pathway to artificial general intelligence. Despite this progress, the limited quality of multimodal data, poor performance on many complex downstream tasks, and inadequate evaluation protocols continue to hinder the reliability and broader applicability of MLLMs across diverse domains. Inspired by the human ability to leverage external tools for enhanced reasoning and problem-solving, augmenting MLLMs with external tools (e.g., APIs, expert models, and knowledge bases) offers a promising strategy to overcome these challenges. In this paper, we present a comprehensive survey on leveraging external tools to enhance MLLM performance. Our discussion is structured along four key dimensions about external tools: (1) how they can facilitate the acquisition and annotation of high-quality multimodal data; (2) how they can assist in improving MLLM performance on challenging downstream tasks; (3) how they enable comprehensive and accurate evaluation of MLLMs; (4) the current limitations and future directions of tool-augmented MLLMs. Through this survey, we aim to underscore the transformative potential of external tools in advancing MLLM capabilities, offering a forward-looking perspective on their development and applications. The project page of this paper is publicly available athttps://github.com/Lackel/Awesome-Tools-for-MLLMs.

Related papers

Multi-Agent Evolve: LLM Self-Improve through Co-evolution [53.00458074754831]
Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs)<n>Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data.<n>We propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A.
arXiv Detail & Related papers (2025-10-27T17:58:02Z)
VLM Q-Learning: Aligning Vision-Language Models for Interactive Decision-Making [45.02997774119763]
Vision-language models (VLMs) extend large language models (LLMs) to multi-modal data.<n>Our work approaches these challenges from an offline-to-online reinforcement learning (RL) perspective.
arXiv Detail & Related papers (2025-05-06T04:51:57Z)
Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework [23.42251949130555]
Multimodal large language models (MLLMs) show promise in tasks like visual question answering (VQA)<n>Recent works adapt agentic frameworks or chain-of-thought (CoT) reasoning to improve performance.<n>We propose Seeing and Reasoning with Confidence (SRICE), a training-free multimodal reasoning framework.
arXiv Detail & Related papers (2025-03-11T11:18:53Z)
Benchmarking Large and Small MLLMs [71.78055760441256]
Large multimodal language models (MLLMs) have achieved remarkable advancements in understanding and generating multimodal content.<n>However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications.<n>Small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offer promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios.
arXiv Detail & Related papers (2025-01-04T07:44:49Z)
A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems. This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z)
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective [53.48484062444108]
We find that the development of models and data is not two separate paths but rather interconnected. On the one hand, vaster and higher-quality data contribute to better performance of MLLMs; on the other hand, MLLMs can facilitate the development of data. To promote the data-model co-development for MLLM community, we systematically review existing works related to MLLMs from the data-model co-development perspective.
arXiv Detail & Related papers (2024-07-11T15:08:11Z)
From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities [111.44485171421535]
We study the generalizability, trustworthiness, and causal reasoning capabilities of recent proprietary and open-source MLLMs across four modalities. We believe these properties are several representative factors that define the reliability of MLLMs. We uncover 14 empirical findings that are useful to understand the capabilities and limitations of both proprietary and open-source MLLMs.
arXiv Detail & Related papers (2024-01-26T18:53:03Z)
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [40.32823306537386]
We propose MLLM-Tool, a system incorporating open-source large language models and multi-modal encoders.<n>Our dataset features multi-modal input tools from HuggingFace.<n>Experiments reveal that our MLLM-Tool is capable of recommending appropriate tools for multi-modal instructions.
arXiv Detail & Related papers (2024-01-19T14:44:37Z)
How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model [12.358079352117699]
We explore Multimodal Large Language Models (MLLMs), which integrate LLMs to handle multimodal data, including text, images, audio, and more.<n>MLLMs face challenges in addressing the semantic gap in multimodal data, which may lead to erroneous outputs.<n>Implementing effective modality alignment can help LLMs address environmental issues and enhance accessibility.
arXiv Detail & Related papers (2023-11-10T09:51:24Z)
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models [73.86954509967416]
Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks. This paper presents the first comprehensive MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks.
arXiv Detail & Related papers (2023-06-23T09:22:36Z)
CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models [74.22729793816451]
Large Language Models (LLMs) have made significant progress in utilizing tools, but their ability is limited by API availability. We propose CREATOR, a novel framework that enables LLMs to create their own tools using documentation and code realization. We evaluate CREATOR on MATH and TabMWP benchmarks, respectively consisting of challenging math competition problems.
arXiv Detail & Related papers (2023-05-23T17:51:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.