Related papers: Re-purposing SAM into Efficient Visual Projectors for MLLM-Based Referring Image Segmentation

Re-purposing SAM into Efficient Visual Projectors for MLLM-Based Referring Image Segmentation

URL: http://arxiv.org/abs/2509.13676v1
Date: Wed, 17 Sep 2025 04:04:08 GMT
Title: Re-purposing SAM into Efficient Visual Projectors for MLLM-Based Referring Image Segmentation
Authors: Xiaobo Yang, Xiaojin Gong,
Abstract summary: We propose a novel semantic visual projector that uses semantic superpixels to identify "visual words" in an image.<n>By compressing and projecting semantic superpixels as visual tokens, our approach adaptively shortens the token sequence according to scene.<n> Experiments show that our method cuts visual tokens by 93% without compromising performance.
Score: 9.120581644616488
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, Referring Image Segmentation (RIS) frameworks that pair the Multimodal Large Language Model (MLLM) with the Segment Anything Model (SAM) have achieved impressive results. However, adapting MLLM to segmentation is computationally intensive, primarily due to visual token redundancy. We observe that traditional patch-wise visual projectors struggle to strike a balance between reducing the number of visual tokens and preserving semantic clarity, often retaining overly long token sequences to avoid performance drops. Inspired by text tokenizers, we propose a novel semantic visual projector that leverages semantic superpixels generated by SAM to identify "visual words" in an image. By compressing and projecting semantic superpixels as visual tokens, our approach adaptively shortens the token sequence according to scene complexity while minimizing semantic loss in compression. To mitigate loss of information, we propose a semantic superpixel positional embedding to strengthen MLLM's awareness of superpixel geometry and position, alongside a semantic superpixel aggregator to preserve both fine-grained details inside superpixels and global context outside. Experiments show that our method cuts visual tokens by 93% without compromising performance, notably speeding up MLLM training and inference, and outperforming existing compressive visual projectors on RIS.

Related papers

ARGenSeg: Image Segmentation with Autoregressive Image Generation Model [46.837184955843355]
We propose a novel AutoRegressive Generation-based paradigm for image (ARGenSeg)<n>Our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed.
arXiv Detail & Related papers (2025-10-23T17:58:26Z)
Text4Seg++: Advancing Image Segmentation via Generative Language Modeling [52.07442359419673]
We propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem.<n>Key innovation is semantic descriptors, a new textual representation of segmentation masks.<n>Experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models.
arXiv Detail & Related papers (2025-09-08T04:07:14Z)
X-SAM: From Segment Anything to Any Segmentation [63.79182974315084]
Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding.<n>We present X-SAM, a streamlined Multimodal Large Language Model framework that extends the segmentation paradigm from textitsegment anything to textitany segmentation.<n>We propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities.
arXiv Detail & Related papers (2025-08-06T17:19:10Z)
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories [52.57696897619189]
We introduce the Human-Like Mask Modeling Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools.<n>HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens.<n>HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task.
arXiv Detail & Related papers (2025-03-11T17:08:54Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.<n> compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.<n>We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
Spatial-Aware Efficient Projector for MLLMs via Multi-Layer Feature Aggregation [10.468784974994465]
The projector plays a crucial role in multi-modal language models (MLLMs) Current explorations on the projector focus on reducing the number of visual tokens to improve efficiency. A Spatial-Aware Efficient Projector (SAEP) is proposed to address this issue.
arXiv Detail & Related papers (2024-10-14T09:25:09Z)
Text4Seg: Reimagining Image Segmentation as Text Generation [32.230379277018194]
We introduce Text4Seg, a novel text-as-mask paradigm that casts image segmentation as a text generation problem.<n>Key innovation is semantic descriptors, a new textual representation of segmentation masks where each image patch is mapped to its corresponding text label.<n>We show that Text4Seg achieves state-of-the-art performance on multiple datasets by fine-tuning different MLLM backbones.
arXiv Detail & Related papers (2024-10-13T14:28:16Z)
Towards Semantic Equivalence of Tokenization in Multimodal LLM [149.11720372278273]
Vision tokenization is essential for semantic alignment between vision and language.<n>This paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok)<n>SeTok groups visual features into semantic units via a dynamic clustering algorithm.<n>The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features.
arXiv Detail & Related papers (2024-06-07T17:55:43Z)
DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models [28.019592576500113]
This study examines the projector module by interpreting the vision-language semantic flow within MLLMs. Our findings reveal that compressive projectors abstract visual patches into a limited set of semantic concepts, such as objects or attributes, resulting in a 'double abstraction' phenomenon. We propose the key insight of 'Decouple Compression from Abstraction (DeCo) that is compressing the visual token number at the patch level by projectors.
arXiv Detail & Related papers (2024-05-31T16:31:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.