Related papers: Generalizable Entity Grounding via Assistance of Large Language Model

Generalizable Entity Grounding via Assistance of Large Language Model

URL: http://arxiv.org/abs/2402.02555v1
Date: Sun, 4 Feb 2024 16:06:05 GMT
Title: Generalizable Entity Grounding via Assistance of Large Language Model
Authors: Lu Qi, Yi-Wen Chen, Lehan Yang, Tiancheng Shen, Xiangtai Li, Weidong Guo, Yu Xu, Ming-Hsuan Yang
Abstract summary: We propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
Score: 77.07759442298666
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In this work, we propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model (LMM) to extract semantic nouns, a class-agnostic segmentation model to generate entity-level segmentation, and the proposed multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask. Additionally, we introduce a strategy of encoding entity segmentation masks into a colormap, enabling the preservation of fine-grained predictions from features of high-resolution masks. This approach allows us to extract visual features from low-resolution images using the CLIP vision encoder in the LMM, which is more computationally efficient than existing approaches that use an additional encoder for high-resolution images. Our comprehensive experiments demonstrate the superiority of our method, outperforming state-of-the-art techniques on three tasks, including panoptic narrative grounding, referring expression segmentation, and panoptic segmentation.

Related papers

HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model [6.641903410779405]
We propose the Hierarchical Mask Tokenizer (HiMTok), which represents segmentation masks with up to 32 tokens. HiMTok allows for compact and coarse-to-fine mask representations, aligning well with the next-token-prediction paradigm. We develop a 3-stage training recipe for progressive learning of segmentation and visual capabilities, featuring a hierarchical mask loss for effective coarse-to-fine learning.
arXiv Detail & Related papers (2025-03-17T10:29:08Z)
Towards Fine-grained Interactive Segmentation in Images and Videos [21.22536962888316]
We present an SAM2Refiner framework built upon the SAM2 backbone. This architecture allows SAM2 to generate fine-grained segmentation masks for both images and videos. In addition, a mask refinement module is devised by employing a multi-scale cascaded structure to fuse mask features with hierarchical representations from the encoder.
arXiv Detail & Related papers (2025-02-12T06:38:18Z)
SketchYourSeg: Mask-Free Subjective Image Segmentation via Freehand Sketches [116.1810651297801]
SketchYourSeg establishes freehand sketches as a powerful query modality for subjective image segmentation. Our evaluations demonstrate superior performance over existing approaches across diverse benchmarks.
arXiv Detail & Related papers (2025-01-27T13:07:51Z)
LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation [16.864086165056698]
Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. We propose to alleviate the issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Our method achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2024-11-30T05:49:42Z)
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model [19.861556031795725]
We introduce a Multi-Granularity Large Multimodal Model (MGLMM) MGLMM is capable of seamlessly adjusting the granularity of Captioning (SegCap) following user instructions. It excels at tackling more than eight downstream tasks and achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-09-20T11:13:31Z)
MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation [33.67313662538398]
We propose a multi-resolution training framework for open-vocabulary semantic segmentation with a single pretrained CLIP backbone. MROVSeg uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. We demonstrate the superiority of MROVSeg on well-established open-vocabulary semantic segmentation benchmarks.
arXiv Detail & Related papers (2024-08-27T04:45:53Z)
FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers [55.2480439325792]
We propose FUSE, an approach to approximating an adapter layer that maps from one model's textual embedding space to another, even across different tokenizers. We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
arXiv Detail & Related papers (2024-08-09T02:16:37Z)
Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images. Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement. Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet) Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z)
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model [49.80313655590392]
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges. It incorporates a mask decoder and a well-designed input schema to handle a variety of segmentation tasks. The flexible design of PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization.
arXiv Detail & Related papers (2024-03-21T17:50:47Z)
N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields [112.02885337510716]
Nested Neural Feature Fields (N2F2) is a novel approach that employs hierarchical supervision to learn a single feature field. We leverage a 2D class-agnostic segmentation model to provide semantically meaningful pixel groupings at arbitrary scales in the image space. Our approach outperforms the state-of-the-art feature field distillation methods on tasks such as open-vocabulary 3D segmentation and localization.
arXiv Detail & Related papers (2024-03-16T18:50:44Z)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation. Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal. We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z)
CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation [29.885991324519463]
We propose a novel cross-modality masked self-distillation framework named CM-MaskSD. Our method inherits the transferred knowledge of image-text semantic alignment from CLIP model to realize fine-grained patch-word feature alignment. Our framework can considerably boost model performance in a nearly parameter-free manner.
arXiv Detail & Related papers (2023-05-19T07:17:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.