Grounding Multimodal Large Language Models in Actions
- URL: http://arxiv.org/abs/2406.07904v1
- Date: Wed, 12 Jun 2024 06:12:04 GMT
- Title: Grounding Multimodal Large Language Models in Actions
- Authors: Andrew Szot, Bogdan Mazoure, Harsh Agrawal, Devon Hjelm, Zsolt Kira, Alexander Toshev,
- Abstract summary: We study how to best ground a MLLM into different embodiments and their associated action spaces.
For continuous actions, we show that a learned tokenization allows for sufficient modeling precision.
For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance.
- Score: 65.88208317380793
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have demonstrated a wide range of capabilities across many domains, including Embodied AI. In this work, we study how to best ground a MLLM into different embodiments and their associated action spaces, with the goal of leveraging the multimodal world knowledge of the MLLM. We first generalize a number of methods through a unified architecture and the lens of action space adaptors. For continuous actions, we show that a learned tokenization allows for sufficient modeling precision, yielding the best performance on downstream tasks. For discrete actions, we demonstrate that semantically aligning these actions with the native output token space of the MLLM leads to the strongest performance. We arrive at these lessons via a thorough study of seven action space adapters on five different environments, encompassing over 114 embodied tasks.
Related papers
- Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters.
We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z) - LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM.
Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM.
We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks [74.52259252807191]
Multimodal Large Language Models (MLLMs) address the complexities of real-world applications far beyond the capabilities of single-modality systems.
This paper systematically sorts out the applications of MLLM in multimodal tasks such as natural language, vision, and audio.
arXiv Detail & Related papers (2024-08-02T15:14:53Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - SilverSight: A Multi-Task Chinese Financial Large Language Model Based on Adaptive Semantic Space Learning [4.540505713937026]
This study introduces an Adaptive Semantic Space Learning (ASSL) framework to enhance the performance and selection efficacy of multi-expert models.
Our research findings demonstrate that our framework can achieve results close to those obtained with full data training using only 10% of the data, while also exhibiting strong generalization capabilities.
arXiv Detail & Related papers (2024-04-07T13:02:21Z) - Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models [25.724995114710165]
We investigate the design aspects of Multimodal Small Language Models (MSLMs) and propose an efficient multimodal assistant named Mipha.
Our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks.
arXiv Detail & Related papers (2024-03-10T12:43:27Z) - u-LLaVA: Unifying Multi-Modal Tasks via Large Language Model [17.3535277338312]
u-LLaVA is an innovative unifying multi-task framework that integrates pixel, regional, and global features to refine the perceptual faculties of MLLMs.
This work contributes a novel mask-based multi-task dataset comprising 277K samples, crafted to challenge and assess the fine-grained perception capabilities of MLLMs.
arXiv Detail & Related papers (2023-11-09T13:18:27Z) - Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE [85.76186554492543]
Large Language Models (LLMs) can extend their zero-shot capabilities to multimodal learning through instruction tuning.
negative conflicts and interference may have a worse impact on performance.
We propose a novel framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with MLLMs.
arXiv Detail & Related papers (2023-11-05T15:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.