BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
- URL: http://arxiv.org/abs/2307.08581v1
- Date: Mon, 17 Jul 2023 15:51:47 GMT
- Title: BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
- Authors: Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang, Jiashi Feng, Bingyi
Kang
- Abstract summary: BuboGPT is a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language.
Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image.
Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human.
- Score: 101.50522135049198
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLMs have demonstrated remarkable abilities at interacting with humans
through language, especially with the usage of instruction-following data.
Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further
enlarge their abilities by incorporating multi-modal inputs, including image,
video, and speech. Despite their effectiveness at generating precise and
detailed language understanding of the given modality signal, these LLMs give
up the ability to ground specific parts of inputs, thus only constructing a
coarse-grained mapping. However, explicit and informative correspondence
between text and other modalities will not only improve the user experience but
also help to expand the application scenario of multi-modal LLMs. Therefore, we
propose BuboGPT, a multi-modal LLM with visual grounding that can perform
cross-modal interaction between vision, audio and language, providing
fine-grained understanding of visual objects and other given modalities. As a
result, BuboGPT is able to point out the specific location of an object in the
image, when it is generating response or description for that object. Our
contributions are two-fold: 1) An off-the-shelf visual grounding module based
on SAM that extracts entities in a sentence and find corresponding masks in the
image. 2) A two-stage training scheme and instruction dataset to endow joint
text-image-audio understanding. Our experiments show that BuboGPT achieves
impressive multi-modality understanding and visual grounding abilities during
the interaction with human. It performs consistently well when provided by
arbitrary modality combinations (either aligned or unaligned). Our code, model
and dataset are available at https://bubo-gpt.github.io .
Related papers
- ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models [73.34709921061928]
We propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs)
We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them.
We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map.
arXiv Detail & Related papers (2024-07-31T11:40:29Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Bridging Vision and Language Spaces with Assignment Prediction [47.04855334955006]
VLAP is a novel approach that bridges pretrained vision models and large language models (LLMs)
We harness well-established word embeddings to bridge two modality embedding spaces.
VLAP achieves substantial improvements over the previous linear transformation-based approaches.
arXiv Detail & Related papers (2024-04-15T10:04:15Z) - Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception [63.03288425612792]
We propose bfAnyRef, a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references.
Our model achieves state-of-the-art results across multiple benchmarks, including diverse modality referring segmentation and region-level referring expression generation.
arXiv Detail & Related papers (2024-03-05T13:45:46Z) - Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing [56.71450690166821]
We propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM)
VSP-LLM is designed to perform multi-tasks of visual speech recognition and translation.
We show that VSP-LLM trained on just 30 hours of labeled data can more effectively translate lip movements.
arXiv Detail & Related papers (2024-02-23T07:21:32Z) - InfMLLM: A Unified Framework for Visual-Language Tasks [44.29407348046122]
multimodal large language models (MLLMs) have attracted growing interest.
This work delves into enabling LLMs to tackle more vision-language-related tasks.
InfMLLM achieves either state-of-the-art (SOTA) performance or performance comparable to recent MLLMs.
arXiv Detail & Related papers (2023-11-12T09:58:16Z) - MIMIC-IT: Multi-Modal In-Context Instruction Tuning [44.879418596312554]
We present a dataset comprising 2.8 million multimodal instruction-response pairs, with 2.2 million unique instructions derived from images and videos.
Using the MIMIC-IT dataset, it has been observed that Otter demonstrates remarkable proficiency in multi-modal perception, reasoning, and in-context learning.
We release the MIMIC-IT dataset, instruction-response collection pipeline, benchmarks, and the Otter model.
arXiv Detail & Related papers (2023-06-08T17:59:56Z) - mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [95.76661165594884]
mPLUG-Owl is a training paradigm that equips large language models (LLMs) with multi-modal abilities.
The training paradigm involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM.
Experimental results show that our model outperforms existing multi-modal models.
arXiv Detail & Related papers (2023-04-27T13:27:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.