MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large
Language Models
- URL: http://arxiv.org/abs/2304.10592v2
- Date: Mon, 2 Oct 2023 16:38:35 GMT
- Title: MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large
Language Models
- Authors: Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny
- Abstract summary: GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text.
We present MiniGPT-4, which aligns a frozen visual encoder with a frozen advanced large language model.
We also observe other emerging capabilities in MiniGPT-4, including writing stories and poems inspired by given images.
- Score: 41.84885546518666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such
as directly generating websites from handwritten text and identifying humorous
elements within images. These features are rarely observed in previous
vision-language models. However, the technical details behind GPT-4 continue to
remain undisclosed. We believe that the enhanced multi-modal generation
capabilities of GPT-4 stem from the utilization of sophisticated large language
models (LLM). To examine this phenomenon, we present MiniGPT-4, which aligns a
frozen visual encoder with a frozen advanced LLM, Vicuna, using one projection
layer. Our work, for the first time, uncovers that properly aligning the visual
features with an advanced large language model can possess numerous advanced
multi-modal abilities demonstrated by GPT-4, such as detailed image description
generation and website creation from hand-drawn drafts. Furthermore, we also
observe other emerging capabilities in MiniGPT-4, including writing stories and
poems inspired by given images, teaching users how to cook based on food
photos, and so on. In our experiment, we found that the model trained on short
image caption pairs could produce unnatural language outputs (e.g., repetition
and fragmentation). To address this problem, we curate a detailed image
description dataset in the second stage to finetune the model, which
consequently improves the model's generation reliability and overall usability.
Our code, pre-trained model, and collected dataset are available at
https://minigpt-4.github.io/.
Related papers
- Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - MiniGPT-Reverse-Designing: Predicting Image Adjustments Utilizing MiniGPT-4 [0.0]
Vision-Language Models (VLMs) have recently seen significant advancements through integrating with Large Language Models (LLMs)
In this paper, we extend and fine-tune MiniGPT-4 for the reverse designing task.
arXiv Detail & Related papers (2024-06-03T03:59:29Z) - Exploring the Distinctiveness and Fidelity of the Descriptions Generated by Large Vision-Language Models [16.524244395901356]
We study how models like Open-Flamingo, IDEFICS, and MiniGPT-4 can distinguish between similar objects and accurately describe visual features.
We propose the Textual Retrieval-Augmented Classification (TRAC) framework, which allows us to delve deeper into analyzing fine-grained visual description generation.
arXiv Detail & Related papers (2024-04-26T16:59:26Z) - MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens [36.02433030551474]
MiniGPT4-Video is a multimodal Large Language Model (LLM) designed specifically for video understanding.
MiniGPT4-video does not only consider visual content but also incorporates textual conversations, allowing the model to effectively answer queries involving both visual and text components.
arXiv Detail & Related papers (2024-04-04T12:46:01Z) - GPT4Video: A Unified Multimodal Large Language Model for
lnstruction-Followed Understanding and Safety-Aware Generation [103.56612788682973]
GPT4Video is a unified multi-model framework that empowers Large Language Models with the capability of both video understanding and generation.
Specifically, we develop an instruction-following-based approach integrated with the stable diffusion generative model, which has demonstrated to effectively and securely handle video generation scenarios.
arXiv Detail & Related papers (2023-11-25T04:05:59Z) - AltDiffusion: A Multilingual Text-to-Image Diffusion Model [4.534546889526814]
We present AltDiffusion, a novel multilingual T2I diffusion model that supports eighteen different languages.
Specifically, we first train a multilingual text encoder based on the knowledge distillation.
Then we plug it into a pretrained English-only diffusion model and train the model with a two-stage schema to enhance the multilingual capability.
arXiv Detail & Related papers (2023-08-19T11:52:12Z) - LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
Understanding [85.39419609430453]
This work enhances the current visual instruction tuning pipeline with text-rich images.
We first use publicly available OCR tools to collect results on 422K text-rich images from the LAION dataset.
We prompt text-only GPT-4 with recognized texts and image captions to generate 16K conversations, each containing question-answer pairs for text-rich images.
arXiv Detail & Related papers (2023-06-29T17:08:16Z) - Visual Instruction Tuning [79.70923292053097]
We present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data.
By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%.
arXiv Detail & Related papers (2023-04-17T17:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.