UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding
- URL: http://arxiv.org/abs/2401.06397v3
- Date: Tue, 29 Oct 2024 07:05:36 GMT
- Title: UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding
- Authors: Bowen Shi, Peisen Zhao, Zichen Wang, Yuhang Zhang, Yaoming Wang, Jin Li, Wenrui Dai, Junni Zou, Hongkai Xiong, Qi Tian, Xiaopeng Zhang,
- Abstract summary: This paper extends Contrastive Language-Image Pre-training (CLIP) with multi-granularity alignment.
We develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities.
With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks.
- Score: 90.74967596080982
- License:
- Abstract: Vision-language foundation models, represented by Contrastive Language-Image Pre-training (CLIP), have gained increasing attention for jointly understanding both vision and textual tasks. However, existing approaches primarily focus on training models to match global image representations with textual descriptions, thereby overlooking the critical alignment between local regions and corresponding text tokens. This paper extends CLIP with multi-granularity alignment. Notably, we deliberately construct a new dataset comprising pseudo annotations at various levels of granularities, encompassing image-level, region-level as well as pixel-level captions and tags. Accordingly, we develop a Unified Multi-Granularity learning framework, termed UMG-CLIP, which simultaneously empowers the model with versatile perception abilities across different levels of detail. With parameter efficient tuning, UMG-CLIP surpasses current widely used CLIP variants and achieves state-of-the-art performance on diverse image understanding benchmarks, including open-world recognition, retrieval, semantic segmentation, and panoptic segmentation tasks. We believe that UMG-CLIP represents a valuable advancement in vision-language foundation models. The code is available at https://github.com/lygsbw/UMG-CLIP.
Related papers
- ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements [0.6990493129893112]
Recent advances in Vision Language Models have reshaped the evaluation paradigm in computer vision tasks.
These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks.
In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications.
Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks.
arXiv Detail & Related papers (2024-11-18T20:31:38Z) - Contrastive Localized Language-Image Pre-Training [60.4967533101887]
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations.
We propose Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules.
CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks.
arXiv Detail & Related papers (2024-10-03T17:56:09Z) - Concept-Guided Prompt Learning for Generalization in Vision-Language
Models [33.361744437967126]
We propose Concept-Guided Prompt Learning for vision-language models.
We leverage the well-learned knowledge of Contrastive Language-Image Pretraining to create a visual concept cache.
In order to refine the text features, we develop a projector that transforms multi-level visual features into text features.
arXiv Detail & Related papers (2024-01-15T04:04:47Z) - Alpha-CLIP: A CLIP Model Focusing on Wherever You Want [77.17294130370921]
Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks.
We introduce Alpha-CLIP, an enhanced version of CLIP with an auxiliary alpha channel to suggest attentive regions and fine-tuned with constructed millions of RGBA region-text pairs.
It demonstrates effectiveness in various tasks, including but not limited to open-world recognition, multimodal large language models, and conditional 2D / 3D generation.
arXiv Detail & Related papers (2023-12-06T18:59:30Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - StyLIP: Multi-Scale Style-Conditioned Prompt Learning for CLIP-based
Domain Generalization [26.08922351077744]
StyLIP is a novel approach for Domain Generalization that enhances CLIP's classification performance across domains.
Our method focuses on a domain-agnostic prompt learning strategy, aiming to disentangle the visual style and content information embedded in CLIP's pre-trained vision encoder.
arXiv Detail & Related papers (2023-02-18T07:36:16Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.