Benchmarking Large and Small MLLMs
- URL: http://arxiv.org/abs/2501.04150v1
- Date: Sat, 04 Jan 2025 07:44:49 GMT
- Title: Benchmarking Large and Small MLLMs
- Authors: Xuelu Feng, Yunsheng Li, Dongdong Chen, Mei Gao, Mengchen Liu, Junsong Yuan, Chunming Qiao,
- Abstract summary: Large multimodal language models (MLLMs) have achieved remarkable advancements in understanding and generating multimodal content.<n>However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications.<n>Small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offer promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios.
- Score: 71.78055760441256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large multimodal language models (MLLMs) such as GPT-4V and GPT-4o have achieved remarkable advancements in understanding and generating multimodal content, showcasing superior quality and capabilities across diverse tasks. However, their deployment faces significant challenges, including slow inference, high computational cost, and impracticality for on-device applications. In contrast, the emergence of small MLLMs, exemplified by the LLava-series models and Phi-3-Vision, offers promising alternatives with faster inference, reduced deployment costs, and the ability to handle domain-specific scenarios. Despite their growing presence, the capability boundaries between large and small MLLMs remain underexplored. In this work, we conduct a systematic and comprehensive evaluation to benchmark both small and large MLLMs, spanning general capabilities such as object recognition, temporal reasoning, and multimodal comprehension, as well as real-world applications in domains like industry and automotive. Our evaluation reveals that small MLLMs can achieve comparable performance to large models in specific scenarios but lag significantly in complex tasks requiring deeper reasoning or nuanced understanding. Furthermore, we identify common failure cases in both small and large MLLMs, highlighting domains where even state-of-the-art models struggle. We hope our findings will guide the research community in pushing the quality boundaries of MLLMs, advancing their usability and effectiveness across diverse applications.
Related papers
- Grounded Chain-of-Thought for Multimodal Large Language Models [66.04061083611863]
We propose a new learning task for multimodal large language models (MLLMs) called Grounded Chain-of-Thought (GCoT)
GCoT is keen to helping MLLMs to recognize and ground the relevant visual cues step by step, thereby predicting the correct answer with grounding coordinates as the intuitive basis.
To facilitate this task, we also carefully design and construct a dataset called multimodal grounded chain-of-thought (MM-GCoT) consisting of 24,022 GCoT examples for 5,033 images.
arXiv Detail & Related papers (2025-03-17T04:07:47Z) - PEBench: A Fictitious Dataset to Benchmark Machine Unlearning for Multimodal Large Language Models [30.909294336713845]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable advancements in tasks such as visual question answering, visual understanding, and reasoning.
However, this impressive progress relies on vast amounts of data collected from the internet, raising significant concerns about privacy and security.
Machine unlearning (MU) has emerged as a promising solution, enabling the removal of specific knowledge from an already trained model without requiring retraining from scratch.
arXiv Detail & Related papers (2025-03-16T15:26:20Z) - DriVLM: Domain Adaptation of Vision-Language Models in Autonomous Driving [20.644133177870852]
multimodal large language models (MLLM) can combine multiple modalities such as pictures, videos, sounds, texts, etc.
Most MLLMs require very high computational resources, which is a major challenge for most researchers and developers.
In this paper, we explored the utility of small-scale MLLMs and applied small-scale MLLMs to the field of autonomous driving.
arXiv Detail & Related papers (2025-01-09T09:02:41Z) - A Comprehensive Survey of Small Language Models in the Era of Large Language Models: Techniques, Enhancements, Applications, Collaboration with LLMs, and Trustworthiness [31.758459020683574]
Small Language Models (SLMs) are increasingly favored for their low inference latency, cost-effectiveness, efficient development, and easy customization and adaptability.<n>These models are particularly well-suited for resource-limited environments and domain knowledge acquisition.<n>We propose defining SLMs by their capability to perform specialized tasks and suitability for resource-constrained settings.
arXiv Detail & Related papers (2024-11-04T04:43:01Z) - Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters.
We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z) - LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM.
Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM.
We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - Efficient Multimodal Large Language Models: A Survey [60.7614299984182]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning.
The extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry.
This survey provides a comprehensive and systematic review of the current state of efficient MLLMs.
arXiv Detail & Related papers (2024-05-17T12:37:10Z) - A Survey on Multimodal Large Language Models [71.63375558033364]
Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot.<n>This paper aims to trace and summarize the recent progress of MLLMs.
arXiv Detail & Related papers (2023-06-23T15:21:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.