Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision
- URL: http://arxiv.org/abs/2410.08209v1
- Date: Thu, 10 Oct 2024 17:59:55 GMT
- Title: Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision
- Authors: Shengcao Cao, Liang-Yan Gui, Yu-Xiong Wang,
- Abstract summary: Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities.
We find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision.
We propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision.
- Score: 29.004844323516412
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an "attend-and-segment" method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: https://groundLMM.github.io.
Related papers
- PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models? [3.707598923599952]
Current trend in pixel-level MLLMs is to train with pixel-level grounding supervision on large-scale labelled data.
We show that such MLLMs when evaluated on recent vision centric benchmarks, exhibit a weak ability in visual question answering.
We propose simple baselines to extract the grounding information that can be plugged into any MLLM, which we call as PixFoundation.
arXiv Detail & Related papers (2025-02-06T16:29:50Z) - Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - F-LMM: Grounding Frozen Large Multimodal Models [53.8059045627934]
We present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations.
Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits.
Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data.
arXiv Detail & Related papers (2024-06-09T15:14:26Z) - LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models [105.7362622712606]
Grounding capability in large multi-modal models (LMMs) is increasingly recognized.
The problem is the lack of a dataset for grounded visual chat (GVC)
We have created GVC data that allows for the combination of grounding and chat capabilities.
Our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities.
arXiv Detail & Related papers (2023-12-05T18:29:31Z) - Compositional Chain-of-Thought Prompting for Large Multimodal Models [46.721769077885966]
Compositional Chain-of-Thought (CCoT) is a novel zero-shot Chain-of-Thought prompting method.
We first generate an SG using the Large Language Model (LLM) and then use that SG in the prompt to produce a response.
We find that the proposed CCoT approach not only improves LMM performance but also improves the performance of several popular LMMs on general multimodal benchmarks.
arXiv Detail & Related papers (2023-11-27T22:23:27Z) - GLaMM: Pixel Grounding Large Multimodal Model [57.91763410032292]
We present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks.
GLaMM is flexible enough to accept both textual and optional visual prompts (region of interest) as input.
Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale.
arXiv Detail & Related papers (2023-11-06T18:59:57Z) - Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection [86.24898024621008]
We present a novel large multimodal model applying vision experts for industrial anomaly detection(abbreviated to Myriad)
We utilize the anomaly map generated by the vision experts as guidance for LMMs, such that the vision model is guided to pay more attention to anomalous regions.
Our proposed method not only performs favorably against state-of-the-art methods, but also inherits the flexibility and instruction-following ability of LMMs in the field of IAD.
arXiv Detail & Related papers (2023-10-29T16:49:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.