Related papers: MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation

MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation

URL: http://arxiv.org/abs/2408.14776v1
Date: Tue, 27 Aug 2024 04:45:53 GMT
Title: MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation
Authors: Yuanbing Zhu, Bingke Zhu, Zhen Chen, Huan Xu, Ming Tang, Jinqiao Wang,
Abstract summary: We propose a multi-resolution training framework for open-vocabulary semantic segmentation with a single pretrained CLIP backbone. MROVSeg uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. We demonstrate the superiority of MROVSeg on well-established open-vocabulary semantic segmentation benchmarks.
Score: 33.67313662538398
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Open-vocabulary semantic segmentation aims to segment and recognize semantically meaningful regions based on text-based descriptions during inference. A typical solution to address this task is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between open- and close-vocabulary recognition. As VLMs are usually pretrained with low-resolution images (e.g. $224\times224$), most previous methods operate only on downscaled images. We question this design as low resolution features often fail to preserve fine details. Although employing additional image backbones for high-resolution inputs can mitigate this issue, it may also introduce significant computation overhead. Therefore, we propose MROVSeg, a multi-resolution training framework for open-vocabulary semantic segmentation with a single pretrained CLIP backbone, that uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. Its key components include a Multi-Res Adapter, which restores the spatial geometry and grasps local-global correspondences across patches by learnable convolutional and scale attention layers. To achieve accurate segmentation, we introduce Multi-grained Masked Attention scheme to aggregate multi-grained semantics by performing cross-attention between object queries and multi-resolution CLIP features within the region of interests. Through comprehensive experiments, we demonstrate the superiority of MROVSeg on well-established open-vocabulary semantic segmentation benchmarks, particularly for high-resolution inputs, establishing new standards for open-vocabulary semantic segmentation.

Related papers

Zooming from Context to Cue: Hierarchical Preference Optimization for Multi-Image MLLMs [74.74767980885758]
We propose Context-to-Cue Direct Preference Optimization (CcDPO), a multi-level preference optimization framework.<n>CcDPO enhances per-image perception in multi-image settings by zooming into visual clues -- from sequential context to local details.<n> Experiments show that CcDPO significantly reduces hallucinations and yields consistent performance gains.
arXiv Detail & Related papers (2025-05-28T14:24:02Z)
Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation [35.50570174431677]
We propose a novel multi-resolution paradigm leveraging Whole Slide Images (WSIs) to extract histology patches at multiple resolutions. We introduce visual-textual alignment at multiple resolutions as well as cross-resolution alignment to establish more effective text-guided visual representations. Our model aims to capture a broader range of information, supported by novel loss functions, enriches feature representation, improves discriminative ability, and enhances generalization across different resolutions.
arXiv Detail & Related papers (2025-04-26T08:44:04Z)
ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z)
LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation [16.864086165056698]
Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets. We propose to alleviate the issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features. Our method achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2024-11-30T05:49:42Z)
InvSeg: Test-Time Prompt Inversion for Semantic Segmentation [33.60580908728705]
InvSeg is a test-time prompt inversion method for semantic segmentation. We introduce Contrastive Soft Clustering to align masks with the image's structure information. InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities.
arXiv Detail & Related papers (2024-10-15T10:20:31Z)
Towards Open-Vocabulary Semantic Segmentation Without Semantic Labels [53.8817160001038]
We propose a novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding. To address the challenges of leveraging masks without semantic labels, we devise an online clustering algorithm. PixelCLIP shows significant performance improvements over CLIP and competitive results compared to caption-supervised methods.
arXiv Detail & Related papers (2024-09-30T01:13:03Z)
MFCLIP: Multi-modal Fine-grained CLIP for Generalizable Diffusion Face Forgery Detection [64.29452783056253]
The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia. Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored. We propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities.
arXiv Detail & Related papers (2024-09-15T13:08:59Z)
Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption. We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z)
SemPLeS: Semantic Prompt Learning for Weakly-Supervised Semantic Segmentation [36.41778553250247]
Weakly-Supervised Semantic (WSSS) aims to train segmentation models using image data with only image-level supervision. We propose a Semantic Prompt Learning for WSSS (SemPLeS) framework, which learns to effectively prompt the CLIP latent space. SemPLeS can perform better semantic alignment between object regions and the associated class labels.
arXiv Detail & Related papers (2024-01-22T09:41:05Z)
LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models [45.672539931681065]
We propose a multi-level interaction paradigm for training lightweight CLIP models. An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
arXiv Detail & Related papers (2023-12-01T15:54:55Z)
MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation [110.09800389100599]
We propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation. Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text. With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability.
arXiv Detail & Related papers (2023-08-09T09:35:16Z)
A Simple Framework for Open-Vocabulary Segmentation and Detection [85.21641508535679]
We present OpenSeeD, a simple Open-vocabulary and Detection framework that jointly learns from different segmentation and detection datasets. We first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them. After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection.
arXiv Detail & Related papers (2023-03-14T17:58:34Z)
Linguistic Query-Guided Mask Generation for Referring Image Segmentation [10.130530501400079]
Referring image segmentation aims to segment the image region of interest according to the given language expression. We propose an end-to-end framework built on transformer to perform Linguistic query-Guided mask generation.
arXiv Detail & Related papers (2023-01-16T13:38:22Z)
Progressively Dual Prior Guided Few-shot Semantic Segmentation [57.37506990980975]
Few-shot semantic segmentation task aims at performing segmentation in query images with a few annotated support samples. We propose a progressively dual prior guided few-shot semantic segmentation network.
arXiv Detail & Related papers (2022-11-20T16:19:47Z)
Learning Enriched Features for Fast Image Restoration and Enhancement [166.17296369600774]
This paper presents a holistic goal of maintaining spatially-precise high-resolution representations through the entire network. We learn an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details. Our approach achieves state-of-the-art results for a variety of image processing tasks, including defocus deblurring, image denoising, super-resolution, and image enhancement.
arXiv Detail & Related papers (2022-04-19T17:59:45Z)
AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance. We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations. AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z)
Unsupervised segmentation via semantic-apparent feature fusion [21.75371777263847]
This research proposes an unsupervised foreground segmentation method based on semantic-apparent feature fusion (SAFF) Key regions of foreground object can be accurately responded via semantic features, while apparent features provide richer detailed expression. By fusing semantic and apparent features, as well as cascading the modules of intra-image adaptive feature weight learning and inter-image common feature learning, the research achieves performance that significantly exceeds baselines.
arXiv Detail & Related papers (2020-05-21T08:28:49Z)
Learning Enriched Features for Real Image Restoration and Enhancement [166.17296369600774]
convolutional neural networks (CNNs) have achieved dramatic improvements over conventional approaches for image restoration task. We present a novel architecture with the collective goals of maintaining spatially-precise high-resolution representations through the entire network. Our approach learns an enriched set of features that combines contextual information from multiple scales, while simultaneously preserving the high-resolution spatial details.
arXiv Detail & Related papers (2020-03-15T11:04:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.