Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
- URL: http://arxiv.org/abs/2512.00395v1
- Date: Sat, 29 Nov 2025 08:52:41 GMT
- Title: Better, Stronger, Faster: Tackling the Trilemma in MLLM-based Segmentation with Simultaneous Textual Mask Prediction
- Authors: Jiazhen Liu, Mingkuan Feng, Long Chen,
- Abstract summary: We present STAMP: Simultaneous Textual All-Mask Prediction, an MLLM that embodies this paradigm.<n>After generating a textual response, STAMP predicts an entire segmentation mask in a single forward pass by treating it as a parallel "fill-in-the-blank" task over image patches.<n>This design maintains the MLLM's dialogue ability by avoiding conflicting objectives, enables high segmentation performance by leveraging rich, bidirectional spatial context for all mask tokens, and achieves exceptional speed.
- Score: 10.813064862132379
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Integrating segmentation into Multimodal Large Language Models (MLLMs) presents a core trilemma: simultaneously preserving dialogue ability, achieving high segmentation performance, and ensuring fast inference. Prevailing paradigms are forced into a compromise. Embedding prediction methods introduce a conflicting pixel-level objective that degrades the MLLM's general dialogue abilities. The alternative, next-token prediction, reframes segmentation as an autoregressive task, which preserves dialogue but forces a trade-off between poor segmentation performance with sparse outputs or prohibitive inference speeds with rich ones. We resolve this trilemma with all-mask prediction, a novel paradigm that decouples autoregressive dialogue generation from non-autoregressive mask prediction. We present STAMP: Simultaneous Textual All-Mask Prediction, an MLLM that embodies this paradigm. After generating a textual response, STAMP predicts an entire segmentation mask in a single forward pass by treating it as a parallel "fill-in-the-blank" task over image patches. This design maintains the MLLM's dialogue ability by avoiding conflicting objectives, enables high segmentation performance by leveraging rich, bidirectional spatial context for all mask tokens, and achieves exceptional speed. Extensive experiments show that STAMP significantly outperforms state-of-the-art methods across multiple segmentation benchmarks, providing a solution that excels in dialogue, segmentation, and speed without compromise.
Related papers
- Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models [67.45032003041399]
We propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.<n>MPCO adaptively balances the importance of different paradigm representations and guides the global optimisation.<n>Our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
arXiv Detail & Related papers (2026-03-05T06:01:26Z) - IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation [44.89730606641666]
IBISAgent reformulates segmentation as a vision-centric, multi-step decision-making process.<n>IBISAgent consistently outperforms both closed-source and open-source SOTA methods.<n>All datasets, code, and trained models will be released publicly.
arXiv Detail & Related papers (2026-01-06T14:37:50Z) - Seg-VAR: Image Segmentation with Visual Autoregressive Modeling [60.79579744943664]
We propose a novel framework that rethinks segmentation as a conditional autoregressive mask generation problem.<n>This is achieved by replacing the discriminative learning with the latent learning process.<n>Our method incorporates three core components: (1) an image encoder generating latent priors from input images, (2) a spatial-aware seglat (a latent expression of segmentation mask) encoder that maps segmentation masks into discrete latent tokens, and (3) a decoder reconstructing masks from these latents.
arXiv Detail & Related papers (2025-11-16T13:36:19Z) - ARGenSeg: Image Segmentation with Autoregressive Image Generation Model [46.837184955843355]
We propose a novel AutoRegressive Generation-based paradigm for image (ARGenSeg)<n>Our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed.
arXiv Detail & Related papers (2025-10-23T17:58:26Z) - Text4Seg++: Advancing Image Segmentation via Generative Language Modeling [52.07442359419673]
We propose a novel text-as-mask paradigm that casts image segmentation as a text generation problem.<n>Key innovation is semantic descriptors, a new textual representation of segmentation masks.<n>Experiments on natural and remote sensing datasets show that Text4Seg++ consistently outperforms state-of-the-art models.
arXiv Detail & Related papers (2025-09-08T04:07:14Z) - Plan for Speed: Dilated Scheduling for Masked Diffusion Language Models [13.575063025878208]
Masked diffusion language models promise fast, non-autoregressive text generation.<n>Existing samplers, which pick tokens to unmask based on model confidence, ignore interactions when unmasking multiple positions in parallel.<n>We propose the Dilated Unmasking Scheduler (DUS), an inference-only, planner-model-free method that partitions sequence positions into non-adjacent dilated groups and unmasked them in parallel.
arXiv Detail & Related papers (2025-06-23T18:49:23Z) - Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision [87.15580604023555]
Unpair-Seg is a novel weakly-supervised open-vocabulary segmentation framework.
It learns from unpaired image-mask and image-text pairs, which can be independently and efficiently collected.
It achieves 14.6% and 19.5% mIoU on the ADE-847 and PASCAL Context-459 datasets.
arXiv Detail & Related papers (2024-02-14T06:01:44Z) - Proposal-Free Volumetric Instance Segmentation from Latent
Single-Instance Masks [16.217524435617744]
This work introduces a new proposal-free instance segmentation method.
It builds on single-instance segmentation masks predicted across the entire image in a sliding window style.
In contrast to related approaches, our method concurrently predicts all masks, one for each pixel, and thus resolves any conflict jointly across the entire image.
arXiv Detail & Related papers (2020-09-10T17:09:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.