Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
- URL: http://arxiv.org/abs/2504.10465v1
- Date: Mon, 14 Apr 2025 17:52:22 GMT
- Title: Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
- Authors: Tao Zhang, Xiangtai Li, Zilong Huang, Yanwei Li, Weixian Lei, Xueqing Deng, Shihao Chen, Shunping Ji, Jiashi Feng,
- Abstract summary: We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks.<n>We propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs.<n>We also introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability.
- Score: 65.11838260342586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Large Language Models (MLLMs) achieve remarkable performance for fine-grained pixel-level understanding tasks. However, all the works rely heavily on extra components, such as vision encoder (CLIP), segmentation experts, leading to high system complexity and limiting model scaling. In this work, our goal is to explore a highly simplified MLLM without introducing extra components. Our work is motivated by the recent works on Single trAnsformer as a unified vIsion-Language Model (SAIL) design, where these works jointly learn vision tokens and text tokens in transformers. We present Pixel-SAIL, a single transformer for pixel-wise MLLM tasks. In particular, we present three technical improvements on the plain baseline. First, we design a learnable upsampling module to refine visual token features. Secondly, we propose a novel visual prompt injection strategy to enable the single transformer to understand visual prompt inputs and benefit from the early fusion of visual prompt embeddings and vision tokens. Thirdly, we introduce a vision expert distillation strategy to efficiently enhance the single transformer's fine-grained feature extraction capability. In addition, we have collected a comprehensive pixel understanding benchmark (PerBench), using a manual check. It includes three tasks: detailed object description, visual prompt-based question answering, and visual-text referring segmentation. Extensive experiments on four referring segmentation benchmarks, one visual prompt benchmark, and our PerBench show that our Pixel-SAIL achieves comparable or even better results with a much simpler pipeline. Code and model will be released at https://github.com/magic-research/Sa2VA.
Related papers
- LLaVA-UHD v2: an MLLM Integrating High-Resolution Semantic Pyramid via Hierarchical Window Transformer [110.39467860530819]
Vision transformers (ViTs) are widely employed in multimodal large language models (MLLMs) for visual encoding.
We present LLaVA-UHD v2, an MLLM with advanced perception abilities by introducing a well-designed vision-language projector.
Hiwin transformer enhances MLLM's ability to capture diverse multi-modal visual granularities, by incorporating our constructed high-resolution semantic pyramid.
arXiv Detail & Related papers (2024-12-18T14:07:46Z) - Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing [150.0380447353081]
We present VITRON, a universal pixel-level vision LLM designed for comprehensive understanding, segmenting, and clusters of both static images and dynamic videos.
Building on top of an LLM, VITRON incorporates encoders for images, videos, and pixel-level regional visuals within its modules, while employing state-of-the-art visual specialists as its backend.
arXiv Detail & Related papers (2024-10-08T08:39:04Z) - OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding [112.87441334765693]
OMG-LLaVA is a new framework combining powerful pixel-level vision understanding with reasoning abilities.
It can accept various visual and text prompts for flexible user interaction.
OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model.
arXiv Detail & Related papers (2024-06-27T17:59:01Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We present the Draw-and-Understand framework, exploring how to integrate visual prompting understanding capabilities into Multimodal Large Language Models (MLLMs)<n>Visual prompts allow users to interact through multi-modal instructions, enhancing the models' interactivity and fine-grained image comprehension.<n>In this framework, we propose a general architecture adaptable to different pre-trained MLLMs, enabling it to recognize various types of visual prompts.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Osprey: Pixel Understanding with Visual Instruction Tuning [15.094943732551018]
Osprey is a mask-text instruction tuning approach to extend MLLMs by incorporating fine-grained mask regions into language instruction.
To achieve this goal, we first curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM.
Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input.
arXiv Detail & Related papers (2023-12-15T18:58:11Z) - PixelLM: Pixel Reasoning with Large Multimodal Model [110.500792765109]
PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding.
It produces masks from the hidden embeddings of the codebook tokens, which encode detailed target-relevant information.
PixelLM excels across various pixel-level image reasoning and understanding tasks, outperforming well-established methods in multiple benchmarks.
arXiv Detail & Related papers (2023-12-04T03:05:59Z) - Minimalist and High-Performance Semantic Segmentation with Plain Vision
Transformers [10.72362704573323]
We introduce the PlainSeg, a model comprising only three 3$times$3 convolutions in addition to the transformer layers.
We also present the PlainSeg-Hier, which allows for the utilization of hierarchical features.
arXiv Detail & Related papers (2023-10-19T14:01:40Z) - SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation [49.65221743520028]
We show that shifting the multiscale inductive bias into the attention mechanism can work well, resulting in a plain detector SimPLR'<n>We find through our experiments that SimPLR with scale-aware attention is plain and simple architecture, yet competitive with multi-scale vision transformer alternatives.
arXiv Detail & Related papers (2023-10-09T17:59:26Z) - Long-Range Grouping Transformer for Multi-View 3D Reconstruction [9.2709012704338]
Long-range grouping attention (LGA) based on the divide-and-conquer principle is proposed.
An effective and efficient encoder can be established which connects inter-view features.
A novel progressive upsampling decoder is also designed for voxel generation with relatively high resolution.
arXiv Detail & Related papers (2023-08-17T01:34:59Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.