Related papers: CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding

URL: http://arxiv.org/abs/2305.08685v5
Date: Tue, 19 Nov 2024 14:52:04 GMT
Title: CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding
Authors: Linhui Xiao, Xiaoshan Yang, Fang Peng, Ming Yan, Yaowei Wang, Changsheng Xu,
Abstract summary: Unsupervised visual grounding has been developed to locate regions using pseudo-labels. We propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels. Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets.
Score: 86.79903269137971
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual Grounding (VG) is a crucial topic in the field of vision and language, which involves locating a specific region described by expressions within an image. To reduce the reliance on manually labeled data, unsupervised visual grounding have been developed to locate regions using pseudo-labels. However, the performance of existing unsupervised methods is highly dependent on the quality of pseudo-labels and these methods always encounter issues with limited diversity. In order to utilize vision and language pre-trained models to address the grounding problem, and reasonably take advantage of pseudo-labels, we propose CLIP-VG, a novel method that can conduct self-paced curriculum adapting of CLIP with pseudo-language labels. We propose a simple yet efficient end-to-end network architecture to realize the transfer of CLIP to the visual grounding. Based on the CLIP-based architecture, we further propose single-source and multi-source curriculum adapting algorithms, which can progressively find more reliable pseudo-labels to learn an optimal model, thereby achieving a balance between reliability and diversity for the pseudo-language labels. Our method outperforms the current state-of-the-art unsupervised method by a significant margin on RefCOCO/+/g datasets in both single-source and multi-source scenarios, with improvements ranging from 6.78$\%$ to 10.67$\%$ and 11.39$\%$ to 14.87$\%$, respectively. The results even outperform existing weakly supervised visual grounding methods. Furthermore, our method is also competitive in fully supervised setting. The code and models are available at https://github.com/linhuixiao/CLIP-VG.

Related papers

Multi-Prompt Progressive Alignment for Multi-Source Unsupervised Domain Adaptation [73.40696661117408]
We propose a progressive alignment strategy for adapting CLIP to unlabeled downstream task.<n>We name our approach MP2A and test it on three popular UDA benchmarks, namely ImageCLEF, Office-Home, and the most challenging DomainNet.<n> Experiments showcase that MP2A achieves state-of-the-art performance when compared with most recent CLIP-based MS-UDA approaches.
arXiv Detail & Related papers (2025-07-31T09:42:42Z)
Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification [16.0058187276343]
Multi-label classification is crucial for comprehensive image understanding. Despite CLIP's proficiency, it suffers from view-dependent predictions and inherent bias, limiting its effectiveness. We propose a novel method that addresses these issues by leveraging multiple views near target objects.
arXiv Detail & Related papers (2025-03-21T06:12:14Z)
Without Paired Labeled Data: An End-to-End Self-Supervised Paradigm for UAV-View Geo-Localization [2.733505168507872]
UAV-View Geo-Localization (UVGL) aims to achieve accurate localization of unmanned aerial vehicles (UAVs) by retrieving the most relevant GPS-tagged satellite images. Existing methods heavily rely on pre-paired UAV-satellite images for supervised learning. We propose an end-to-end self-supervised UVGL method to overcome these limitations.
arXiv Detail & Related papers (2025-02-17T02:53:08Z)
Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels [19.740929527669483]
Multi-label recognition with partial labels (MLR-PL) is a practical task in computer vision. We introduce a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework. Our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.
arXiv Detail & Related papers (2024-12-14T14:31:36Z)
SiamSeg: Self-Training with Contrastive Learning for Unsupervised Domain Adaptation Semantic Segmentation in Remote Sensing [14.007392647145448]
UDA enables models to learn from unlabeled target domain data while training on labeled source domain data. We propose integrating contrastive learning into UDA, enhancing the model's capacity to capture semantic information. Our SimSeg method outperforms existing approaches, achieving state-of-the-art results.
arXiv Detail & Related papers (2024-10-17T11:59:39Z)
CLIP-Guided Source-Free Object Detection in Aerial Images [17.26407623526735]
High-resolution aerial images often require substantial storage space and may not be readily accessible to the public. We propose a novel Source-Free Object Detection (SFOD) method to address these challenges. To alleviate the noisy labels in self-training, we utilize Contrastive Language-Image Pre-training (CLIP) to guide the generation of pseudo-labels. By leveraging CLIP's zero-shot classification capability, we aggregate its scores with the original predicted bounding boxes, enabling us to obtain refined scores for the pseudo-labels.
arXiv Detail & Related papers (2024-01-10T14:03:05Z)
TagCLIP: A Local-to-Global Framework to Enhance Open-Vocabulary Multi-Label Classification of CLIP Without Training [29.431698321195814]
Contrastive Language-Image Pre-training (CLIP) has demonstrated impressive capabilities in open-vocabulary classification. CLIP shows poor performance on multi-label datasets because the global feature tends to be dominated by the most prominent class. We propose a local-to-global framework to obtain image tags.
arXiv Detail & Related papers (2023-12-20T08:15:40Z)
Towards Realistic Zero-Shot Classification via Self Structural Semantic Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification. In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary. We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z)
Realistic Unsupervised CLIP Fine-tuning with Universal Entropy Optimization [101.08992036691673]
This paper explores a realistic unsupervised fine-tuning scenario, considering the presence of out-of-distribution samples from unknown classes. In particular, we focus on simultaneously enhancing out-of-distribution detection and the recognition of instances associated with known classes. We present a simple, efficient, and effective approach called Universal Entropy Optimization (UEO)
arXiv Detail & Related papers (2023-08-24T16:47:17Z)
ReCLIP: Refine Contrastive Language Image Pre-Training with Source Free Domain Adaptation [20.57370550156505]
ReCLIP is a source-free domain adaptation method for vision-language models. We demonstrate ReCLIP reduces the average error rate of CLIP from 30.17% to 25.06% on 22 image classification benchmarks.
arXiv Detail & Related papers (2023-08-04T18:11:40Z)
Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images. MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z)
SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption [72.35532598131176]
We propose SCARF, a technique for contrastive learning, where views are formed by corrupting a random subset of features. We show that SCARF complements existing strategies and outperforms alternatives like autoencoders.
arXiv Detail & Related papers (2021-06-29T08:08:33Z)
Joint Visual and Temporal Consistency for Unsupervised Domain Adaptive Person Re-Identification [64.37745443119942]
This paper jointly enforces visual and temporal consistency in the combination of a local one-hot classification and a global multi-class classification. Experimental results on three large-scale ReID datasets demonstrate the superiority of proposed method in both unsupervised and unsupervised domain adaptive ReID tasks.
arXiv Detail & Related papers (2020-07-21T14:31:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.