Related papers: Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

URL: http://arxiv.org/abs/2409.13407v1
Date: Fri, 20 Sep 2024 11:13:31 GMT
Title: Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model
Authors: Li Zhou, Xu Yuan, Zenghui Sun, Zikun Zhou, Jingsong Lan,
Abstract summary: We introduce a Multi-Granularity Large Multimodal Model (MGLMM) MGLMM is capable of seamlessly adjusting the granularity of Captioning (SegCap) following user instructions. It excels at tackling more than eight downstream tasks and achieves state-of-the-art performance.
Score: 19.861556031795725
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Multimodal Models (LMMs) have achieved significant progress by extending large language models. Building on this progress, the latest developments in LMMs demonstrate the ability to generate dense pixel-wise segmentation through the integration of segmentation models.Despite the innovations, the textual responses and segmentation masks of existing works remain at the instance level, showing limited ability to perform fine-grained understanding and segmentation even provided with detailed textual cues.To overcome this limitation, we introduce a Multi-Granularity Large Multimodal Model (MGLMM), which is capable of seamlessly adjusting the granularity of Segmentation and Captioning (SegCap) following user instructions, from panoptic SegCap to fine-grained SegCap. We name such a new task Multi-Granularity Segmentation and Captioning (MGSC). Observing the lack of a benchmark for model training and evaluation over the MGSC task, we establish a benchmark with aligned masks and captions in multi-granularity using our customized automated annotation pipeline. This benchmark comprises 10K images and more than 30K image-question pairs. We will release our dataset along with the implementation of our automated dataset annotation pipeline for further research.Besides, we propose a novel unified SegCap data format to unify heterogeneous segmentation datasets; it effectively facilitates learning to associate object concepts with visual features during multi-task training. Extensive experiments demonstrate that our MGLMM excels at tackling more than eight downstream tasks and achieves state-of-the-art performance in MGSC, GCG, image captioning, referring segmentation, multiple and empty segmentation, and reasoning segmentation tasks. The great performance and versatility of MGLMM underscore its potential impact on advancing multimodal research.

Related papers

Segment and Matte Anything in a Unified Model [5.8874968768571625]
Segment Anything (SAM) has recently pushed the boundaries of segmentation by demonstrating zero-shot generalization and flexible prompting.<n>We introduce Segment And Matte Anything (SAMA), a lightweight extension of SAM that delivers high-quality interactive image segmentation and matting.
arXiv Detail & Related papers (2026-01-17T19:43:10Z)
ARGenSeg: Image Segmentation with Autoregressive Image Generation Model [46.837184955843355]
We propose a novel AutoRegressive Generation-based paradigm for image (ARGenSeg)<n>Our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed.
arXiv Detail & Related papers (2025-10-23T17:58:26Z)
X-SAM: From Segment Anything to Any Segmentation [63.79182974315084]
Large Language Models (LLMs) demonstrate strong capabilities in broad knowledge representation, yet they are inherently deficient in pixel-level perceptual understanding.<n>We present X-SAM, a streamlined Multimodal Large Language Model framework that extends the segmentation paradigm from textitsegment anything to textitany segmentation.<n>We propose a new segmentation task, termed Visual GrounDed (VGD) segmentation, which segments all instance objects with interactive visual prompts and empowers MLLMs with visual grounded, pixel-wise interpretative capabilities.
arXiv Detail & Related papers (2025-08-06T17:19:10Z)
HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model [6.641903410779405]
We propose the Hierarchical Mask Tokenizer (HiMTok), which represents segmentation masks with up to 32 tokens. HiMTok allows for compact and coarse-to-fine mask representations, aligning well with the next-token-prediction paradigm. We develop a 3-stage training recipe for progressive learning of segmentation and visual capabilities, featuring a hierarchical mask loss for effective coarse-to-fine learning.
arXiv Detail & Related papers (2025-03-17T10:29:08Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption. compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model [49.80313655590392]
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges. It incorporates a mask decoder and a well-designed input schema to handle a variety of segmentation tasks. The flexible design of PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization.
arXiv Detail & Related papers (2024-03-21T17:50:47Z)
Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z)
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks. It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences. We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z)
LISA: Reasoning Segmentation via Large Language Model [68.24075852136761]
We propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. We present LISA: large Language Instructed Assistant, which inherits the language generation capabilities of multimodal Large Language Models.
arXiv Detail & Related papers (2023-08-01T17:50:17Z)
Semantic-SAM: Segment and Recognize Anything at Any Granularity [83.64686655044765]
We introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. We consolidate multiple datasets across three granularities and introduce decoupled classification for objects and parts. For the multi-granularity capability, we propose a multi-choice learning scheme during training, enabling each click to generate masks at multiple levels.
arXiv Detail & Related papers (2023-07-10T17:59:40Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.