Related papers: Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs

Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs

URL: http://arxiv.org/abs/2510.16785v1
Date: Sun, 19 Oct 2025 10:21:01 GMT
Title: Segmentation as A Plug-and-Play Capability for Frozen Multimodal LLMs
Authors: Jiazhen Liu, Long Chen,
Abstract summary: We introduce LENS (Leveraging kEypoiNts for MLLMs'), a novel plug-and-play solution.<n>LENS attaches a lightweight, trainable head to a completely frozen MLLM.<n>It achieves segmentation performance competitive with or superior to that of retraining-based methods.
Score: 9.6979217203587
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Integrating diverse visual capabilities into a unified model is a significant trend in Multimodal Large Language Models (MLLMs). Among these, the inclusion of segmentation poses a distinct set of challenges. To equip MLLMs with pixel-level segmentation abilities, prevailing methods require finetuning the model to produce specific outputs compatible with a mask decoder. This process typically alters the model's output space and compromises its intrinsic generalization, which undermines the goal of building a unified model. We introduce LENS (Leveraging kEypoiNts for MLLMs' Segmentation), a novel plug-and-play solution. LENS attaches a lightweight, trainable head to a completely frozen MLLM. By refining the spatial cues embedded in attention maps, LENS extracts keypoints and describes them into point-wise features directly compatible with the mask decoder. Extensive experiments validate our approach: LENS achieves segmentation performance competitive with or superior to that of retraining-based methods. Crucially, it does so while fully preserving the MLLM's generalization capabilities, which are significantly degraded by finetuning approaches. As such, the attachable design of LENS establishes an efficient and powerful paradigm for extending MLLMs, paving the way for truly multi-talented, unified models.

Related papers

From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion [91.35078719566472]
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection.<n>We introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities.
arXiv Detail & Related papers (2026-01-15T18:59:10Z)
FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z)
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents [55.82787697101274]
Bifrost-1 is a unified framework that bridges pretrained multimodal LLMs (MLLMs) and diffusion models.<n>By seamlessly integrating pretrained MLLMs and diffusion models with patch-level CLIP latents, our framework enables high-fidelity controllable image generation.<n>Our experiments demonstrate that Bifrost-1 achieves comparable or better performance than previous methods in terms of visual fidelity and multimodal understanding.
arXiv Detail & Related papers (2025-08-08T02:38:47Z)
Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder [18.236863512276187]
We propose a novel framework that exploits the inherent visual detail features encoded in the MLLM vision encoder without introducing an extra visual encoder.<n>Besides, we propose a detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully integrates the detail-related visual feature with the semantic-related feature output by the large language model (LLM) of MLLM.<n>Our method generally surpasses both SAM-based and SAM-free competitors, striking a better balance between performance and cost.
arXiv Detail & Related papers (2025-08-06T06:06:52Z)
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories [52.57696897619189]
We introduce the Human-Like Mask Modeling Task (HLMAT), a new paradigm where MLLMs mimic human annotators using interactive segmentation tools.<n>HLMAT enables MLLMs to iteratively generate text-based click points, achieving high-quality masks without architectural changes or implicit tokens.<n>HLMAT provides a protocol for assessing fine-grained pixel understanding in MLLMs and introduces a vision-centric, multi-step decision-making task.
arXiv Detail & Related papers (2025-03-11T17:08:54Z)
LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models [9.660892239615364]
This work explores fusion strategies of visual tokens for hybrid MLLMs, leading to the design of LEO.<n>Leo is a novel MLLM with a dual-branch vision encoder framework that incorporates a post-adaptation fusion strategy and adaptive tiling.<n>We show that LEO can be adapted to the specialized domain of autonomous driving without altering the model architecture or training recipe.
arXiv Detail & Related papers (2025-01-13T00:29:55Z)
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding [66.74446220401296]
We propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation.<n>We introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding.<n>Our code and models shall be released.
arXiv Detail & Related papers (2024-12-12T18:59:26Z)
LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [72.68665884790002]
We propose a novel framework to transfer knowledge from l-MLLMs to s-MLLMs.<n>We introduce Multimodal Distillation (MDist) to transfer teacher model's robust representations across both visual and linguistic modalities.<n>We also propose a three-stage training scheme to fully exploit the potential of the proposed distillation strategy.
arXiv Detail & Related papers (2024-10-21T17:41:28Z)
Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens. Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z)
CLIP Is Also a Good Teacher: A New Learning Framework for Inductive Zero-shot Semantic Segmentation [6.181169909576527]
Generalized Zero-shot Semantic aims to segment both seen and unseen categories only under the supervision of the seen ones. Existing methods adopt the large-scale Vision Language Models (VLMs) which obtain outstanding zero-shot performance. We propose CLIP-ZSS (Zero-shot Semantic), a training framework that enables any image encoder designed for closed-set segmentation applied in zero-shot and open-vocabulary tasks.
arXiv Detail & Related papers (2023-10-03T09:33:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.