UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World
Understanding
- URL: http://arxiv.org/abs/2401.06397v2
- Date: Thu, 18 Jan 2024 16:40:22 GMT
- Title: UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World
Understanding
- Authors: Bowen Shi, Peisen Zhao, Zichen Wang, Yuhang Zhang, Yaoming Wang, Jin
Li, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian, Xiaopeng Zhang
- Abstract summary: This paper extends Contrastive language-image pre-training (CLIP) with multi-granularity alignment.
We develop a unified multi-granularity learning framework, named UMG-CLIP, that simultaneously empowers the model with versatile perception abilities across different levels of detail.
- Score: 93.45067274442881
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language foundation models, represented by Contrastive language-image
pre-training (CLIP), have gained increasing attention for jointly understanding
both vision and textual tasks. However, existing approaches primarily focus on
training models to match global image representations with textual
descriptions, thereby overlooking the critical alignment between local regions
and corresponding text tokens. This paper extends CLIP with multi-granularity
alignment. Notably, we deliberately construct a new dataset comprising pseudo
annotations at various levels of granularities, encompassing image-level,
region-level, and pixel-level captions/tags. Accordingly, we develop a unified
multi-granularity learning framework, named UMG-CLIP, that simultaneously
empowers the model with versatile perception abilities across different levels
of detail. Equipped with parameter efficient tuning, UMG-CLIP surpasses current
widely used CLIP models and achieves state-of-the-art performance on diverse
image understanding benchmarks, including open-world recognition, retrieval,
semantic segmentation, and panoptic segmentation tasks. We hope UMG-CLIP can
serve as a valuable option for advancing vision-language foundation models.
Related papers
- SmartCLIP: Modular Vision-language Alignment with Identification Guarantees [59.16312652369709]
Contrastive Language-Image Pre-training (CLIP)citepradford2021learning has emerged as a pivotal model in computer vision and multimodal learning.<n>CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation.<n>We introduce ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner.
arXiv Detail & Related papers (2025-07-29T22:26:20Z) - Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels [19.740929527669483]
Multi-label recognition with partial labels (MLR-PL) is a practical task in computer vision.
We introduce a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework.
Our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.
arXiv Detail & Related papers (2024-12-14T14:31:36Z) - LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation [16.864086165056698]
Existing open-vocabulary approaches leverage vision-language models, such as CLIP, to align visual features with rich semantic features acquired through pre-training on large-scale vision-language datasets.
We propose to alleviate the issues by leveraging multiple large-scale models to enhance the alignment between fine-grained visual features and enriched linguistic features.
Our method achieves state-of-the-art performance across all major open-vocabulary segmentation benchmarks.
arXiv Detail & Related papers (2024-11-30T05:49:42Z) - ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements [0.6990493129893112]
Recent advances in Vision Language Models have reshaped the evaluation paradigm in computer vision tasks.
These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks.
In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications.
Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks.
arXiv Detail & Related papers (2024-11-18T20:31:38Z) - Contrastive Localized Language-Image Pre-Training [60.4967533101887]
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations.
We propose Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules.
CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks.
arXiv Detail & Related papers (2024-10-03T17:56:09Z) - Concept-Guided Prompt Learning for Generalization in Vision-Language
Models [33.361744437967126]
We propose Concept-Guided Prompt Learning for vision-language models.
We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache.
In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
arXiv Detail & Related papers (2024-01-15T04:04:47Z) - Alpha-CLIP: A CLIP Model Focusing on Wherever You Want [77.17294130370921]
Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks.
We introduce Alpha-CLIP, an enhanced version of CLIP with an auxiliary alpha channel to suggest attentive regions and fine-tuned with constructed millions of RGBA region-text pairs.
It demonstrates effectiveness in various tasks, including but not limited to open-world recognition, multimodal large language models, and conditional 2D / 3D generation.
arXiv Detail & Related papers (2023-12-06T18:59:30Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based
Domain Generalization [26.08922351077744]
StyLIP is a novel approach for Domain Generalization that enhances CLIP's classification performance across domains.
Our method focuses on a domain-agnostic prompt learning strategy, aiming to disentangle the visual style and content information embedded in CLIP's pre-trained vision encoder.
arXiv Detail & Related papers (2023-02-18T07:36:16Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.