PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?
- URL: http://arxiv.org/abs/2502.04192v2
- Date: Sun, 23 Feb 2025 11:01:02 GMT
- Title: PixFoundation: Are We Heading in the Right Direction with Pixel-level Vision Foundation Models?
- Authors: Mennatullah Siam,
- Abstract summary: Current trend in pixel-level MLLMs is to train with pixel-level grounding supervision on large-scale labelled data.<n>We show that such MLLMs when evaluated on recent vision centric benchmarks, exhibit a weak ability in visual question answering.<n>We propose simple baselines to extract the grounding information that can be plugged into any MLLM, which we call as PixFoundation.
- Score: 3.707598923599952
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multiple works have emerged to push the boundaries on multi-modal large language models (MLLMs) towards pixel-level understanding. Such approaches have shown strong performance on benchmarks for referring expression segmentation and grounded conversation generation. The current trend in pixel-level MLLMs is to train with pixel-level grounding supervision on large-scale labelled data. However, we show that such MLLMs when evaluated on recent challenging vision centric benchmarks, exhibit a weak ability in visual question answering. Surprisingly, some of these methods even downgrade the grounding ability of MLLMs that were never trained with such supervision. In this work, we propose two novel challenging benchmarks and show that MLLMs without pixel-level grounding supervision can outperform the state of the art in such tasks when evaluating both the pixel-level grounding and visual question answering. We propose simple baselines to extract the grounding information that can be plugged into any MLLM, which we call as PixFoundation. More importantly, we study the research question of "When does grounding emerge in MLLMs that are not trained with pixel-level grounding supervision?" We show that grounding can coincide with object parts or location/appearance information. Code repository is at https://github.com/MSiam/PixFoundation/.
Related papers
- Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding [65.11838260342586]
We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks.
We propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs.
We also introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability.
arXiv Detail & Related papers (2025-04-14T17:52:22Z) - Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images [58.38037252899024]
We present a system using Multimodal LLMs to analyze a large database with tens of millions of images.
We aim to capture frequent co-occurring changes ("trends") across a city over a certain period.
We find it significantly outperforms baselines and is able to discover interesting trends from images captured in large cities.
arXiv Detail & Related papers (2025-04-11T17:55:45Z) - Evaluating Graphical Perception with Multimodal LLMs [2.090547583226381]
Multimodal Large Language Models (MLLMs) have remarkably progressed in analyzing and understanding images.
For visualization, how do MLLMs perform when applied to graphical perception tasks?
Our study primarily evaluates fine-tuned and pretrained models and zero-shot prompting to determine if they closely match human graphical perception.
arXiv Detail & Related papers (2025-04-05T16:14:08Z) - SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories [52.57696897619189]
We introduce the Human-Like Mask Modeling Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools.
HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens.
HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task.
arXiv Detail & Related papers (2025-03-11T17:08:54Z) - MC-Bench: A Benchmark for Multi-Context Visual Grounding in the Era of MLLMs [61.56904387052982]
This paper proposes a new visual grounding task called multi-context visual grounding.
It aims to localize instances of interest across multiple images based on open-ended text prompts.
We benchmark over 20 state-of-the-art MLLMs and foundation models with potential multi-context visual grounding capabilities.
arXiv Detail & Related papers (2024-10-16T07:52:57Z) - Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills.
To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding.
We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision [29.004844323516412]
Current large multimodal models (LMMs) face challenges in grounding, which requires the model to relate language components to visual entities.
We find that the grounding ability can in fact emerge in LMMs trained without explicit grounding supervision.
We propose DIFFLMM, an LMM utilizing a diffusion-based visual encoder, as opposed to the standard CLIP visual encoder, and trained with the same weak supervision.
arXiv Detail & Related papers (2024-10-10T17:59:55Z) - Can Large Language Models Understand Symbolic Graphics Programs? [136.5639211254501]
Symbolic graphics programs are popular in computer graphics.<n>We create a benchmark for the semantic visual understanding of symbolic graphics programs.<n>We find that LLMs considered stronger at reasoning generally perform better.
arXiv Detail & Related papers (2024-08-15T17:59:57Z) - Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input.
Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints.
We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z) - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language
Model as an Agent [23.134180979449823]
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment.
We propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline.
Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries.
arXiv Detail & Related papers (2023-09-21T17:59:45Z) - Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding [46.042197741423365]
Large language models (LLMs) have made significant advancements in natural language understanding.
This work investigates if it is possible for the LLM to understand images as well.
arXiv Detail & Related papers (2023-06-09T17:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.