MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large
Language Models
- URL: http://arxiv.org/abs/2304.10592v2
- Date: Mon, 2 Oct 2023 16:38:35 GMT
- Title: MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large
Language Models
- Authors: Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
- Abstract summary: GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text.
We present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced large language model.
We also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images.
- Score: 41.84885546518666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such
as directly generating websites from handwritten text and identifying humorous
elements within images. These features are rarely observed in previous
vision-language models. However, the technical details behind GPT-4 continue to
remain undisclosed. We believe that the enhanced multi-modal generation
capabilities of GPT-4 stem from the utilization of sophisticated large language
models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a
frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection
layer. Our work, for the first time, uncovers that properly aligning the visual
features with an advanced large language model can possess numerous advanced
multi-modal abilities demonstrated by GPT-4, such as detailed image description
generation and website creation from hand-drawn drafts. Furthermore, we also
observe other emerging capabilities in MiniGPT-4, including writing stories and
poems inspired by given images, teaching users how to cook based on food
photos, and so on. In our experiment, we found that the model trained on short
image caption pairs could produce unnatural language outputs (e.g., repetition
and fragmentation). To address this problem, we curate a detailed image
description dataset in the second stage to finetune the model, which
consequently improves the model's generation reliability and overall usability.
Our code, pre-trained model, and collected dataset are available at
https://minigpt-4.github.io/.
Related papers
- MIO: A Foundation Model on Multimodal Tokens [74.85153216521945]
We introduce MIO, a novel foundation model built on multimodal tokens.
MIO is capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner.
arXiv Detail & Related papers (2024-09-26T09:57:16Z) - FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs [58.95386070800286]
FullAnno is a data engine that generates large-scale, high-quality, and fine-grained image annotations.
We re-annotated the COCO and Visual Genome datasets using our FullAnno system.
Experiments show that the regenerated annotation can significantly enhance the capabilities of LLaVA-v1.5 on several benchmarks.
arXiv Detail & Related papers (2024-09-20T14:33:17Z) - Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models [38.52953013858373]
We introduce Playground v3 (PGv3), our latest text-to-image model.
It achieves state-of-the-art (SoTA) performance across multiple testing benchmarks.
It excels in text prompt adherence, complex reasoning, and accurate text rendering.
arXiv Detail & Related papers (2024-09-16T19:52:24Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models [16.524244395901356]
We study how models like Open-Flamingo, IDEFICS, and MiniGPT-4 can distinguish between similar objects and accurately describe visual features.
We propose the Textual Retrieval-Augmented Classification (TRAC) framework, which allows us to delve deeper into analyzing fine-grained visual description generation.
arXiv Detail & Related papers (2024-04-26T16:59:26Z) - MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens [36.02433030551474]
MiniGPT4-Video is a multimodal Large Language Model (LLM) designed specifically for video understanding.
MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components.
arXiv Detail & Related papers (2024-04-04T12:46:01Z) - Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models [60.81438804824749]
Multimodal instruction-following models extend capabilities by integrating both text and images.
Existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images.
We introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions.
We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images.
arXiv Detail & Related papers (2023-08-31T05:15:27Z) - Visual Instruction Tuning [79.70923292053097]
We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.
By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
arXiv Detail & Related papers (2023-04-17T17:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.