PixelLM: Pixel Reasoning with Large Multimodal Model
- URL: http://arxiv.org/abs/2312.02228v3
- Date: Thu, 18 Jul 2024 07:18:36 GMT
- Title: PixelLM: Pixel Reasoning with Large Multimodal Model
- Authors: Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, Xiaojie Jin,
- Abstract summary: PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding.
It produces masks from the hidden embeddings of the codebook tokens, which encode detailed target-relevant information.
PixelLM excels across various pixel-level image reasoning and understanding tasks, outperforming well-established methods in multiple benchmarks.
- Score: 110.500792765109
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While large multimodal models (LMMs) have achieved remarkable progress, generating pixel-level masks for image reasoning tasks involving multiple open-world targets remains a challenge. To bridge this gap, we introduce PixelLM, an effective and efficient LMM for pixel-level reasoning and understanding. Central to PixelLM is a novel, lightweight pixel decoder and a comprehensive segmentation codebook. The decoder efficiently produces masks from the hidden embeddings of the codebook tokens, which encode detailed target-relevant information. With this design, PixelLM harmonizes with the structure of popular LMMs and avoids the need for additional costly segmentation models. Furthermore, we propose a target refinement loss to enhance the model's ability to differentiate between multiple targets, leading to substantially improved mask quality. To advance research in this area, we construct MUSE, a high-quality multi-target reasoning segmentation benchmark. PixelLM excels across various pixel-level image reasoning and understanding tasks, outperforming well-established methods in multiple benchmarks, including MUSE, single- and multi-referring segmentation. Comprehensive ablations confirm the efficacy of each proposed component. All code, models, and datasets will be publicly available.
Related papers
- GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing [22.729750410621826]
GeoPix is a RS MLLM that extends image understanding capabilities to the pixel level.
To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor.
To address the absence of large-scale datasets for training pixel-level RS MLLMs, we construct the GeoPixInstruct dataset.
arXiv Detail & Related papers (2025-01-12T14:45:27Z) - PiLaMIM: Toward Richer Visual Representations by Integrating Pixel and Latent Masked Image Modeling [7.630967411418269]
We propose PiLaMIM, a unified framework that combines Pixel MIM and Latent MIM to integrate their complementary strengths.
Our method uses a single encoder along with two distinct decoders: one for predicting pixel values and another for latent representations, ensuring the capture of both high-level and low-level visual features.
arXiv Detail & Related papers (2025-01-06T13:30:16Z) - Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.
compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.
We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z) - Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model [19.861556031795725]
We introduce a Multi-Granularity Large Multimodal Model (MGLMM)
MGLMM is capable of seamlessly adjusting the granularity of Captioning (SegCap) following user instructions.
It excels at tackling more than eight downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-09-20T11:13:31Z) - Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders [89.38717274524681]
This study explores the design space for multimodal large language models (MLLMs) using a mixture of vision encoders and resolutions.
Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach.
The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
arXiv Detail & Related papers (2024-08-28T17:59:31Z) - Matryoshka Multimodal Models [92.41824727506751]
We propose M3: Matryoshka Multimodal Models, which learns to represent visual content as nested sets of visual tokens.
We find that COCO-style benchmarks only need around 9 visual tokens to obtain accuracy similar to that of using all 576 tokens.
arXiv Detail & Related papers (2024-05-27T17:59:56Z) - PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model [49.80313655590392]
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges.
It incorporates a mask decoder and a well-designed input schema to handle a variety of segmentation tasks.
The flexible design of PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization.
arXiv Detail & Related papers (2024-03-21T17:50:47Z) - Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption.
We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.