Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage
and Sharing in LLMs
- URL: http://arxiv.org/abs/2311.15759v1
- Date: Mon, 27 Nov 2023 12:29:20 GMT
- Title: Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage
and Sharing in LLMs
- Authors: Yunxin Li, Baotian Hu, Wei Wang, Xiaochun Cao, Min Zhang
- Abstract summary: We propose an approach called MKS2, aimed at enhancing multimodal large language models (MLLMs)
Specifically, we introduce the Modular Visual Memory, a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently.
Our experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge.
- Score: 72.49064988035126
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advancements in multimodal large language models (MLLMs) have achieved
significant multimodal generation capabilities, akin to GPT-4. These models
predominantly map visual information into language representation space,
leveraging the vast knowledge and powerful text generation abilities of LLMs to
produce multimodal instruction-following responses. We could term this method
as LLMs for Vision because of its employing LLMs for visual-language
understanding, yet observe that these MLLMs neglect the potential of harnessing
visual knowledge to enhance overall capabilities of LLMs, which could be
regraded as Vision Enhancing LLMs. In this paper, we propose an approach called
MKS2, aimed at enhancing LLMs through empowering Multimodal Knowledge Storage
and Sharing in LLMs. Specifically, we introduce the Modular Visual Memory, a
component integrated into the internal blocks of LLMs, designed to store
open-world visual information efficiently. Additionally, we present a soft
Mixtures-of-Multimodal Experts architecture in LLMs to invoke multimodal
knowledge collaboration during generation. Our comprehensive experiments
demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs
in contexts necessitating physical or commonsense knowledge. It also delivers
competitive results on multimodal benchmarks.
Related papers
- LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation [60.02145113467427]
LLMs' strong textual understanding can improve CLIP's ability to handle image captions.
We propose LLM2CLIP, a novel approach that embraces the power of LLMs to unlock CLIP's potential.
Our experiments demonstrate that this approach brings substantial improvements in cross-modal tasks.
arXiv Detail & Related papers (2024-11-07T18:59:16Z) - LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [70.19607283302712]
We propose a novel framework to transfer knowledge from l-MLLM to s-MLLM.
Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of l-MLLM and s-MLLM.
We also propose a three-stage training scheme to fully exploit the potential of s-MLLM.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs [40.74693126923826]
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities.
Training adapters with image-level supervision often results in significant misalignment.
We introduce Supervised Embedding Alignment (SEA), a token-level alignment method that leverages vision-language pre-trained models.
arXiv Detail & Related papers (2024-08-21T17:58:02Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - From Image to Video, what do we need in multimodal LLMs? [19.85928004619801]
Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information.
We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs.
Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models.
arXiv Detail & Related papers (2024-04-18T02:43:37Z) - Knowledge Fusion of Large Language Models [73.28202188100646]
This paper introduces the notion of knowledge fusion for large language models (LLMs)
We externalize their collective knowledge and unique strengths, thereby elevating the capabilities of the target model beyond those of any individual source LLM.
Our findings confirm that the fusion of LLMs can improve the performance of the target model across a range of capabilities such as reasoning, commonsense, and code generation.
arXiv Detail & Related papers (2024-01-19T05:02:46Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.