Related papers: Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception

Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception

URL: http://arxiv.org/abs/2508.11256v1
Date: Fri, 15 Aug 2025 06:43:51 GMT
Title: Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception
Authors: Junjie Wang, Keyu Chen, Yulin Li, Bin Chen, Hengshuang Zhao, Xiaojuan Qi, Zhuotao Tian,
Abstract summary: DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
Score: 71.26728044621458
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Dense visual perception tasks have been constrained by their reliance on predefined categories, limiting their applicability in real-world scenarios where visual concepts are unbounded. While Vision-Language Models (VLMs) like CLIP have shown promise in open-vocabulary tasks, their direct application to dense perception often leads to suboptimal performance due to limitations in local feature representation. In this work, we present our observation that CLIP's image tokens struggle to effectively aggregate information from spatially or semantically related regions, resulting in features that lack local discriminability and spatial consistency. To address this issue, we propose DeCLIP, a novel framework that enhances CLIP by decoupling the self-attention module to obtain ``content'' and ``context'' features respectively. \revise{The context features are enhanced by jointly distilling semantic correlations from Vision Foundation Models (VFMs) and object integrity cues from diffusion models, thereby enhancing spatial consistency. In parallel, the content features are aligned with image crop representations and constrained by region correlations from VFMs to improve local discriminability. Extensive experiments demonstrate that DeCLIP establishes a solid foundation for open-vocabulary dense perception, consistently achieving state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.} Code is available at https://github.com/xiaomoguhz/DeCLIP

Related papers

Structure-Aware Feature Rectification with Region Adjacency Graphs for Training-Free Open-Vocabulary Semantic Segmentation [22.409969687852506]
We propose a structure-aware feature rectification approach that incorporates instance-specific priors derived directly from the image.<n>Our method effectively suppresses segmentation noise, improves region-level consistency, and achieves strong performance on multiple open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2025-12-08T10:00:36Z)
Weakly-Supervised Image Forgery Localization via Vision-Language Collaborative Reasoning Framework [16.961220047066792]
ViLaCo is a vision-language collaborative reasoning framework that introduces auxiliary semantic supervision distilled from pre-trained vision-language models.<n>ViLaCo substantially outperforms existing WSIFL methods, achieving state-of-the-art performance in both detection and localization accuracy.
arXiv Detail & Related papers (2025-08-02T12:14:29Z)
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance [54.683384204063934]
Large multi-modal models (LMMs) struggle with inaccurate segmentation and hallucinated comprehension.<n>We propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation.<n>LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
arXiv Detail & Related papers (2025-07-08T07:46:26Z)
FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation [55.01077993490845]
Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling.<n>We introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework.
arXiv Detail & Related papers (2025-06-20T07:46:40Z)
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception [21.87721909270275]
DeCLIP is a novel framework that enhances CLIP with content'' and context'' features respectively.<n>It significantly outperforms existing methods across multiple open-vocabulary dense prediction tasks.
arXiv Detail & Related papers (2025-05-07T13:46:34Z)
Refining CLIP's Spatial Awareness: A Visual-Centric Perspective [10.936397225984107]
Contrastive Language-Image Pre-training excels in global alignment with language but exhibits limited sensitivity to spatial information.<n>Recent approaches have introduced Region-Language Alignment to enhance CLIP's performance in dense multimodal tasks.<n>We propose the Spatial Correlation Distillation (SCD) framework, which preserves CLIP's inherent spatial structure and mitigates the above degradation.
arXiv Detail & Related papers (2025-04-03T07:04:56Z)
AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Document Understanding [79.43306110124875]
AlignVLM is a vision-text alignment method that maps visual features to a weighted average of text embeddings.<n>Our experiments show that AlignVLM achieves state-of-the-art performance compared to prior alignment methods.
arXiv Detail & Related papers (2025-02-03T13:34:51Z)
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference [27.551367463011008]
Cross-correlation of self-attention in CLIP's non-final layers also exhibits localization properties. We propose the Residual Cross-correlation Self-attention (RCS) module, which leverages the cross-correlation self-attention from intermediate layers to remold the attention in the final block. The RCS module effectively reorganizes spatial information, unleashing the localization potential within CLIP for dense vision-language inference.
arXiv Detail & Related papers (2024-11-24T14:14:14Z)
ProxyCLIP: Proxy Attention Improves CLIP for Open-Vocabulary Segmentation [32.852004564832455]
Open-vocabulary semantic segmentation requires models to integrate visual representations with semantic labels. This paper introduces ProxyCLIP, a framework designed to harmonize the strengths of Contrastive Language-Image Pre-training (CLIP) and Vision Foundation Models (VFMs) As a training-free approach, ProxyCLIP significantly improves the average mean Intersection over Union (mIoU) across eight benchmarks from 40.3 to 44.4.
arXiv Detail & Related papers (2024-08-09T06:17:00Z)
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [90.74967596080982]
This paper extends Contrastive Language-Image Pre-training (CLIP) with multi-granularity alignment. We develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks.
arXiv Detail & Related papers (2024-01-12T06:35:09Z)
Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA) IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors. IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.