Related papers: GSVA: Generalized Segmentation via Multimodal Large Language Models

GSVA: Generalized Segmentation via Multimodal Large Language Models

URL: http://arxiv.org/abs/2312.10103v3
Date: Thu, 21 Mar 2024 09:20:49 GMT
Title: GSVA: Generalized Segmentation via Multimodal Large Language Models
Authors: Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, Gao Huang,
Abstract summary: Generalized Referring Expression (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image. Current solutions to GRES remain unsatisfactory since segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt. We propose Generalized Vision Assistant (GSVA) to address this gap.
Score: 72.57095903188922
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image. GRES poses challenges in modeling the complex spatial relationships of the instances in the image and identifying non-existing referents. Multimodal Large Language Models (MLLMs) have recently shown tremendous progress in these complicated vision-language tasks. Connecting Large Language Models (LLMs) and vision models, MLLMs are proficient in understanding contexts with visual inputs. Among them, LISA, as a representative, adopts a special [SEG] token to prompt a segmentation mask decoder, e.g., SAM, to enable MLLMs in the RES task. However, existing solutions to GRES remain unsatisfactory since current segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt or provide descriptions incongruent with any image target. In this paper, we propose Generalized Segmentation Vision Assistant (GSVA) to address this gap. Specifically, GSVA reuses the [SEG] token to prompt the segmentation model towards supporting multiple mask references simultaneously and innovatively learns to generate a [REJ] token to reject the null targets explicitly. Experiments validate GSVA's efficacy in resolving the GRES issue, marking a notable enhancement and setting a new record on the GRES benchmark gRefCOCO dataset. GSVA also proves effective across various classic referring segmentation and comprehension tasks.

Related papers

GenRecal: Generation after Recalibration from Large to Small Vision-Language Models [63.27511432647797]
Vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V.<n>Recent advancements in vision-language models (VLMs) have leveraged large language models (LLMs) to achieve performance on par with closed-source systems like GPT-4V.
arXiv Detail & Related papers (2025-06-18T17:59:49Z)
Refer to Anything with Vision-Language Prompts [43.00233077605867]
"Refer to Any Mask Group" (RAS) augments segmentation models with complex multimodal interactions and comprehension.<n>We demonstrate superior performance of RAS on our new ORES task, as well as classic referring expression segmentation (RES) and generalized referring expression segmentation (GRES) tasks.
arXiv Detail & Related papers (2025-06-05T17:59:51Z)
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing [33.19843463374473]
Vision-Language Models (VLMs) in remote sensing have demonstrated significant potential in traditional tasks. Current models, which excel in Referring Expression (REC), struggle with tasks involving complex instructions. We introduce the Remote Sensing Vision-Language Task Set (RSVLTS), which includes Open-Vocabulary Tasks (OVT), Referring Expression Tasks (RET), and Described Object Tasks (DOT) We propose a novel unified data representation using a set-of-points approach for RSVLTS, along with a condition and a self-augmentation strategy based on cyclic referring.
arXiv Detail & Related papers (2025-03-16T12:48:17Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption. compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs. We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling [80.85164509232261]
We propose OneRef, a minimalist referring framework built on the modality-shared one-tower transformer. To modeling the referential relationship, we introduce a novel MVLM paradigm called Mask Referring Modeling (MRefM) Within MRefM, we propose a referring-aware dynamic image masking strategy that is aware of the referred region.
arXiv Detail & Related papers (2024-10-10T15:18:19Z)
EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models [80.00303150568696]
We propose a novel Multimodal Large Language Models (MLLM) that empowers comprehension of arbitrary referring visual prompts with less training efforts than existing approaches. Our approach embeds referring visual prompts as spatial concepts conveying specific spatial areas comprehensible to the MLLM. We also propose a Geometry-Agnostic Learning paradigm (GAL) to further disentangle the MLLM's region-level comprehension with the specific formats of referring visual prompts.
arXiv Detail & Related papers (2024-09-25T08:22:00Z)
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model [19.861556031795725]
We introduce a Multi-Granularity Large Multimodal Model (MGLMM) MGLMM is capable of seamlessly adjusting the granularity of Captioning (SegCap) following user instructions. It excels at tackling more than eight downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-09-20T11:13:31Z)
Bring Adaptive Binding Prototypes to Generalized Referring Expression Segmentation [18.806738617249426]
Generalized Referring Expression introduces new challenges by allowing expressions to describe multiple objects or lack specific object references. Existing RES methods, usually rely on sophisticated encoder-decoder and feature fusion modules. We propose a novel Model with Adaptive Binding Prototypes (MABP) that adaptively binds queries to object features in the corresponding region.
arXiv Detail & Related papers (2024-05-24T03:07:38Z)
UniRAG: Universal Retrieval Augmentation for Large Vision Language Models [76.30799731147589]
We introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models and smaller open-source models significantly enhance their generation quality.
arXiv Detail & Related papers (2024-05-16T17:58:45Z)
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model [49.80313655590392]
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges. It incorporates a mask decoder and a well-designed input schema to handle a variety of segmentation tasks. The flexible design of PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization.
arXiv Detail & Related papers (2024-03-21T17:50:47Z)
LISA: Reasoning Segmentation via Large Language Model [68.24075852136761]
We propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. We present LISA: large Language Instructed Assistant, which inherits the language generation capabilities of multimodal Large Language Models.
arXiv Detail & Related papers (2023-08-01T17:50:17Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.