LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge
- URL: http://arxiv.org/abs/2311.11860v2
- Date: Sun, 26 Nov 2023 10:10:55 GMT
- Title: LION : Empowering Multimodal Large Language Model with Dual-Level Visual
Knowledge
- Authors: Gongwei Chen, Leyang Shen, Rui Shao, Xiang Deng, Liqiang Nie
- Abstract summary: Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals.
Most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs, leading to insufficient extraction and reasoning of visual knowledge.
We propose a dual-Level vIsual knedgeOwl eNhanced Multimodal Large Language Model (LION), which empowers the MLLM by injecting visual knowledge in two levels.
- Score: 58.82222646803248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability
to perceive and understand multi-modal signals. However, most of the existing
MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text
pairs, leading to insufficient extraction and reasoning of visual knowledge. To
address this issue, we devise a dual-Level vIsual knOwledge eNhanced Multimodal
Large Language Model (LION), which empowers the MLLM by injecting visual
knowledge in two levels. 1) Progressive incorporation of fine-grained
spatial-aware visual knowledge. We design a vision aggregator cooperated with
region-level vision-language (VL) tasks to incorporate fine-grained
spatial-aware visual knowledge into the MLLM. To alleviate the conflict between
image-level and region-level VL tasks during incorporation, we devise a
dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This
progressive incorporation scheme contributes to the mutual promotion between
these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual
evidence. We facilitate the MLLM with high-level semantic visual evidence by
leveraging diverse image tags. To mitigate the potential influence caused by
imperfect predicted tags, we propose a soft prompting method by embedding a
learnable token into the tailored text instruction. Comprehensive experiments
on several multi-modal benchmarks demonstrate the superiority of our model
(e.g., improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over
InstructBLIP, 5% accuracy on RefCOCOg over Kosmos-2).
Related papers
- SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs [40.74693126923826]
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities.
Training adapters with image-level supervision often results in significant misalignment.
We introduce Supervised Embedding Alignment (SEA), a token-level alignment method that leverages vision-language pre-trained models.
arXiv Detail & Related papers (2024-08-21T17:58:02Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks.
However, they fall short to comprehend context involving multiple images.
We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z) - Large Model Based Referring Camouflaged Object Detection [51.80619142347807]
Referring camouflaged object detection (Ref-COD) is a recently-proposed problem aiming to segment out specified camouflaged objects matched with a textual or visual reference.
Our motivation is to make full use of the semantic intelligence and intrinsic knowledge of recent Multimodal Large Language Models (MLLMs) to decompose this complex task in a human-like way.
We propose a large-model-based Multi-Level Knowledge-Guided multimodal method for Ref-COD termed MLKG.
arXiv Detail & Related papers (2023-11-28T13:45:09Z) - Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness
and Ethics [32.123919380959485]
Multi-modal large language models (MLLMs) are trained based on large language models (LLM)
While they excel in multi-modal tasks, the pure NLP abilities of MLLMs are often underestimated and left untested.
We show that visual instruction tuning, a prevailing strategy for transitioning LLMs into MLLMs, unexpectedly and interestingly helps models attain both improved truthfulness and ethical alignment.
arXiv Detail & Related papers (2023-09-13T17:57:21Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.