LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
- URL: http://arxiv.org/abs/2312.02949v1
- Date: Tue, 5 Dec 2023 18:29:31 GMT
- Title: LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
- Authors: Hao Zhang, Hongyang Li, Feng Li, Tianhe Ren, Xueyan Zou, Shilong Liu,
Shijia Huang, Jianfeng Gao, Lei Zhang, Chunyuan Li, Jianwei Yang
- Abstract summary: Grounding capability in large multi-modal models (LMMs) is increasingly recognized.
The problem is the lack of a dataset for grounded visual chat (GVC)
We have created GVC data that allows for the combination of grounding and chat capabilities.
Our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities.
- Score: 105.7362622712606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the recent significant advancements in large multi-modal models (LMMs),
the importance of their grounding capability in visual chat is increasingly
recognized. Despite recent efforts to enable LMMs to support grounding, their
capabilities for grounding and chat are usually separate, and their chat
performance drops dramatically when asked to ground. The problem is the lack of
a dataset for grounded visual chat (GVC). Existing grounding datasets only
contain short captions. To address this issue, we have created GVC data that
allows for the combination of grounding and chat capabilities. To better
evaluate the GVC capabilities, we have introduced a benchmark called
Grounding-Bench. Additionally, we have proposed a model design that can support
GVC and various types of visual prompts by connecting segmentation models with
language models. Experimental results demonstrate that our model outperforms
other LMMs on Grounding-Bench. Furthermore, our model achieves competitive
performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K
Entities. Our code will be released at
https://github.com/UX-Decoder/LLaVA-Grounding .
Related papers
- Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision [29.004844323516412]
Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities.
We find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision.
We propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision.
arXiv Detail & Related papers (2024-10-10T17:59:55Z) - CoDi: Conversational Distillation for Grounded Question Answering [10.265241619616676]
We introduce a novel data distillation framework named CoDi.
CoDi allows us to synthesize large-scale, assistant-style datasets in a steerable and diverse manner.
We show that SLMs trained with CoDi-synthesized data achieve performance comparable to models trained on human-annotated data in standard metrics.
arXiv Detail & Related papers (2024-08-20T22:35:47Z) - Learning Visual Grounding from Generative Vision and Language Model [29.2712567454021]
Visual grounding tasks aim to localize image regions based on natural language references.
We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting.
Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world.
arXiv Detail & Related papers (2024-07-18T20:29:49Z) - GLaMM: Pixel Grounding Large Multimodal Model [57.91763410032292]
We present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks.
GLaMM is flexible enough to accept both textual and optional visual prompts (region of interest) as input.
Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale.
arXiv Detail & Related papers (2023-11-06T18:59:57Z) - Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z) - Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on
Self-Chat Data [101.63682141248069]
Chat models, such as ChatGPT, have shown impressive capabilities and have been rapidly adopted across numerous domains.
We propose a pipeline that can automatically generate a high-quality multi-turn chat corpus by leveraging ChatGPT.
We employ parameter-efficient tuning to enhance LLaMA, an open-source large language model.
arXiv Detail & Related papers (2023-04-03T17:59:09Z) - Localizing Moments in Long Video Via Multimodal Guidance [51.72829274071017]
We propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows.
Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ)
arXiv Detail & Related papers (2023-02-26T18:19:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.