Related papers: OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

URL: http://arxiv.org/abs/2406.19389v2
Date: Tue, 01 Oct 2024 06:07:24 GMT
Title: OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Authors: Tao Zhang, Xiangtai Li, Hao Fei, Haobo Yuan, Shengqiong Wu, Shunping Ji, Chen Change Loy, Shuicheng Yan,
Abstract summary: OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
Score: 112.87441334765693
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.

Related papers

UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture [46.88622050135756]
We present UniPercept-Bench, a unified framework for perceptual-level image understanding across three key domains: Aesthetics, Quality, Structure and Texture.<n>We develop a strong baseline UniPercept trained via Domain-Aligned Pre-Training and Task-Aligned RL, enabling robust generalization across both Visual Rating (VR) and Visual Question Answering (VQA) tasks.
arXiv Detail & Related papers (2025-12-25T13:35:52Z)
PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images [58.73779101355669]
PixelCraft is a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images.<n>Building on this foundation, PixelCraft facilitates visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism.
arXiv Detail & Related papers (2025-09-29T17:59:49Z)
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning [83.68366772745689]
We propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses.<n>Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference.<n>The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos.
arXiv Detail & Related papers (2025-09-22T17:59:40Z)
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding [65.11838260342586]
We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. We propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs. We also introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability.
arXiv Detail & Related papers (2025-04-14T17:52:22Z)
HyperSeg: Towards Universal Visual Segmentation with Large Language Model [40.326267496697845]
This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs) We propose HyperSeg, the first VLLM-based universal segmentation model for pixel-level image and video perception.
arXiv Detail & Related papers (2024-11-26T17:18:20Z)
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing [150.0380447353081]
We present VITRON, a universal pixel-level vision LLM designed for comprehensive understanding, segmenting, and clusters of both static images and dynamic videos. Building on top of an LLM, VITRON incorporates encoders for images, videos, and pixel-level regional visuals within its modules, while employing state-of-the-art visual specialists as its backend.
arXiv Detail & Related papers (2024-10-08T08:39:04Z)
Instruction Tuning-free Visual Token Complement for Multimodal LLMs [51.138806401996696]
multimodal large language models (MLLMs) have promised an elegant bridge between vision and language. We propose a Visual Token Complement framework (VTC) that helps MLLMs regain the missing visual features. Our VTC integrates text-to-image generation as a guide to identifying the text-irrelevant features, and a visual selector is then developed to generate complementary visual tokens.
arXiv Detail & Related papers (2024-08-09T12:13:01Z)
ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension [71.03445074045092]
We propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens. Our method unifies the prompt and answer of visual referential tasks without using additional syntax. ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency.
arXiv Detail & Related papers (2024-06-17T08:39:16Z)
Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting. Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM. To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z)
Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest. This technique allows LVLMs to access more detailed visual information without altering the original image resolution. Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z)
VCoder: Versatile Vision Encoders for Multimodal Large Language Models [46.95488342139727]
Multimodal Large Language Models (MLLM) have recently achieved impressive performance on vision-language tasks. However, when prompted to identify or count (perceive) the entities in a given image, existing MLLM systems fail. We propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs.
arXiv Detail & Related papers (2023-12-21T18:49:47Z)
TouchStone: Evaluating Vision-Language Models by Language Models [91.69776377214814]
We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs. We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks. We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
arXiv Detail & Related papers (2023-08-31T17:52:04Z)
VLMAE: Vision-Language Masked Autoencoder [21.97700040013084]
We propose a vision-language masked autoencoder framework (VLMAE) for vision-language pre-training. VLMAE employs visual generative learning, facilitating the model to acquire fine-grained and unbiased features.
arXiv Detail & Related papers (2022-08-19T14:39:18Z)
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.