PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?
- URL: http://arxiv.org/abs/2502.04192v1
- Date: Thu, 06 Feb 2025 16:29:50 GMT
- Title: PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?
- Authors: Mennatullah Siam,
- Abstract summary: Current trend in pixel-level MLLMs is to train with pixel-level grounding supervision on large-scale labelled data.
We show that such MLLMs when evaluated on recent vision centric benchmarks, exhibit a weak ability in visual question answering.
We propose simple baselines to extract the grounding information that can be plugged into any MLLM, which we call as PixFoundation.
- Score: 3.707598923599952
- License:
- Abstract: Multiple works have emerged to push the boundaries on multi-modal large language models (MLLMs) towards pixel-level understanding. Such approaches have shown strong performance on benchmarks for referring expression segmentation and grounded conversation generation. The current trend in pixel-level MLLMs is to train with pixel-level grounding supervision on large-scale labelled data. However, we show that such MLLMs when evaluated on recent challenging vision centric benchmarks, exhibit a weak ability in visual question answering. Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such supervision. In this work, we propose two novel challenging benchmarks and show that MLLMs without pixel-level grounding supervision can outperform the state of the art in such tasks when evaluating both the pixel-level grounding and visual question answering. We propose simple baselines to extract the grounding information that can be plugged into any MLLM, which we call as PixFoundation. More importantly, we study the research question of ``When does grounding emerge in MLLMs that are not trained with pixel-level grounding supervision?'' We show that grounding can coincide with object parts or location/appearance information. Code repository is at https://github.com/MSiam/PixFoundation/.
Related papers
- MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision [29.004844323516412]
Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities.
We find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision.
We propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision.
arXiv Detail & Related papers (2024-10-10T17:59:55Z) - Can Large Language Models Understand Symbolic Graphics Programs? [136.5639211254501]
Symbolic graphics programs are popular in computer graphics.
We create a benchmark for the semantic visual understanding of symbolic graphics programs.
We find that LLMs considered stronger at reasoning generally perform better.
arXiv Detail & Related papers (2024-08-15T17:59:57Z) - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - Probing Multimodal Large Language Models for Global and Local Semantic Representations [57.25949445963422]
We study which layers of Multimodal Large Language Models make the most effort to the global image information.
In this study, we find that the intermediate layers of models can encode more global semantic information.
We find that the topmost layers may excessively focus on local information, leading to a diminished ability to encode global information.
arXiv Detail & Related papers (2024-02-27T08:27:15Z) - Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision.
We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs.
We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z) - LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language
Model as an Agent [23.134180979449823]
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment.
We propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline.
Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries.
arXiv Detail & Related papers (2023-09-21T17:59:45Z) - Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding [46.042197741423365]
Large language models (LLMs) have made significant advancements in natural language understanding.
This work investigates if it is possible for the LLM to understand images as well.
arXiv Detail & Related papers (2023-06-09T17:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.