Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision
- URL: http://arxiv.org/abs/2410.08209v1
- Date: Thu, 10 Oct 2024 17:59:55 GMT
- Title: Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision
- Authors: Shengcao Cao, Liang-Yan Gui, Yu-Xiong Wang,
- Abstract summary: Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities.
We find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision.
We propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision.
- Score: 29.004844323516412
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an "attend-and-segment" method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: https://groundLMM.github.io.
Related papers
- Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training [49.407311947143825]
We present Mono-InternVL, a novel monolithic MLLM that seamlessly integrates a set of visual experts via a multimodal mixture-of-experts structure.
We also propose an innovative pre-training strategy to maximize the visual capability of Mono-InternVL, namely Endogenous Visual Pre-training (EViP)
arXiv Detail & Related papers (2024-10-10T17:59:22Z) - Towards Open-World Grasping with Large Vision-Language Models [5.317624228510749]
An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning.
We propose OWG, an open-world grasping pipeline that combines vision-language models with segmentation and grasp synthesis models.
We conduct evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language.
arXiv Detail & Related papers (2024-06-26T19:42:08Z) - F-LMM: Grounding Frozen Large Multimodal Models [53.8059045627934]
We present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations.
Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits.
Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data.
arXiv Detail & Related papers (2024-06-09T15:14:26Z) - LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models [105.7362622712606]
Grounding capability in large multi-modal models (LMMs) is increasingly recognized.
The problem is the lack of a dataset for grounded visual chat (GVC)
We have created GVC data that allows for the combination of grounding and chat capabilities.
Our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities.
arXiv Detail & Related papers (2023-12-05T18:29:31Z) - Compositional Chain-of-Thought Prompting for Large Multimodal Models [46.721769077885966]
Compositional Chain-of-Thought (CCoT) is a novel zero-shot Chain-of-Thought prompting method.
We first generate an SG using the Large Language Model (LLM) and then use that SG in the prompt to produce a response.
We find that the proposed CCoT approach not only improves LMM performance but also improves the performance of several popular LMMs on general multimodal benchmarks.
arXiv Detail & Related papers (2023-11-27T22:23:27Z) - GLaMM: Pixel Grounding Large Multimodal Model [57.91763410032292]
We present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks.
GLaMM is flexible enough to accept both textual and optional visual prompts (region of interest) as input.
Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale.
arXiv Detail & Related papers (2023-11-06T18:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.