Related papers: Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

URL: http://arxiv.org/abs/2410.08209v1
Date: Thu, 10 Oct 2024 17:59:55 GMT
Title: Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision
Authors: Shengcao Cao, Liang-Yan Gui, Yu-Xiong Wang,
Abstract summary: Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. We find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. We propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision.
Score: 29.004844323516412
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities. Contrary to the common practice that fine-tunes LMMs with additional grounding supervision, we find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision. To reveal this emerging grounding, we introduce an "attend-and-segment" method which leverages attention maps from standard LMMs to perform pixel-level segmentation. Furthermore, to enhance the grounding ability, we propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision. Without being constrained by the biases and limited scale of grounding-specific supervision data, our approach is more generalizable and scalable. We achieve competitive performance on both grounding-specific and general visual question answering benchmarks, compared with grounding LMMs and generalist LMMs, respectively. Notably, we achieve a 44.2 grounding mask recall on grounded conversation generation without any grounding supervision, outperforming the extensively supervised model GLaMM. Project page: https://groundLMM.github.io.

Related papers

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding [67.24430397016275]
We propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.
arXiv Detail & Related papers (2025-03-12T06:01:05Z)
PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models? [3.707598923599952]
Current trend in pixel-level MLLMs is to train with pixel-level grounding supervision on large-scale labelled data. We show that such MLLMs when evaluated on recent vision centric benchmarks, exhibit a weak ability in visual question answering. We propose simple baselines to extract the grounding information that can be plugged into any MLLM, which we call as PixFoundation.
arXiv Detail & Related papers (2025-02-06T16:29:50Z)
Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills. To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding. We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
Towards Open-World Grasping with Large Vision-Language Models [5.317624228510749]
An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning. We propose OWG, an open-world grasping pipeline that combines vision-language models with segmentation and grasp synthesis models. We conduct evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language.
arXiv Detail & Related papers (2024-06-26T19:42:08Z)
F-LMM: Grounding Frozen Large Multimodal Models [53.8059045627934]
We present F-LMM -- grounding frozen off-the-shelf LMMs in human-AI conversations. Using only a few trainable CNN layers, we can translate word-pixel attention weights to mask logits. Our F-LMM neither learns special segmentation tokens nor utilises high-quality grounded instruction-tuning data.
arXiv Detail & Related papers (2024-06-09T15:14:26Z)
Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models [87.47400128150032]
We propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement. Lumen first promotes fine-grained vision-language concept alignment. Then the task-specific decoding is carried out by flexibly routing the shared representation to lightweight task decoders.
arXiv Detail & Related papers (2024-03-12T04:13:45Z)
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models [105.7362622712606]
Grounding capability in large multi-modal models (LMMs) is increasingly recognized. The problem is the lack of a dataset for grounded visual chat (GVC) We have created GVC data that allows for the combination of grounding and chat capabilities. Our model achieves competitive performance on classic grounding benchmarks like RefCOCO/+/g and Flickr30K Entities.
arXiv Detail & Related papers (2023-12-05T18:29:31Z)
Compositional Chain-of-Thought Prompting for Large Multimodal Models [46.721769077885966]
Compositional Chain-of-Thought (CCoT) is a novel zero-shot Chain-of-Thought prompting method. We first generate an SG using the Large Language Model (LLM) and then use that SG in the prompt to produce a response. We find that the proposed CCoT approach not only improves LMM performance but also improves the performance of several popular LMMs on general multimodal benchmarks.
arXiv Detail & Related papers (2023-11-27T22:23:27Z)
GLaMM: Pixel Grounding Large Multimodal Model [57.91763410032292]
We present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM is flexible enough to accept both textual and optional visual prompts (region of interest) as input. Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale.
arXiv Detail & Related papers (2023-11-06T18:59:57Z)
Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection [86.24898024621008]
We present a novel large multimodal model applying vision experts for industrial anomaly detection(abbreviated to Myriad) We utilize the anomaly map generated by the vision experts as guidance for LMMs, such that the vision model is guided to pay more attention to anomalous regions. Our proposed method not only performs favorably against state-of-the-art methods, but also inherits the flexibility and instruction-following ability of LMMs in the field of IAD.
arXiv Detail & Related papers (2023-10-29T16:49:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.