S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist
Captions
- URL: http://arxiv.org/abs/2305.14095v2
- Date: Wed, 25 Oct 2023 14:49:23 GMT
- Title: S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist
Captions
- Authors: Sangwoo Mo, Minkyu Kim, Kyungmin Lee, Jinwoo Shin
- Abstract summary: Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains.
We propose S-CLIP, a semi-supervised learning method for training CLIP that utilizes additional unpaired images.
S-CLIP improves CLIP by 10% for zero-shot classification and 4% for image-text retrieval on the remote sensing benchmark.
- Score: 69.01985134519244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language models, such as contrastive language-image pre-training
(CLIP), have demonstrated impressive results in natural image domains. However,
these models often struggle when applied to specialized domains like remote
sensing, and adapting to such domains is challenging due to the limited number
of image-text pairs available for training. To address this, we propose S-CLIP,
a semi-supervised learning method for training CLIP that utilizes additional
unpaired images. S-CLIP employs two pseudo-labeling strategies specifically
designed for contrastive learning and the language modality. The caption-level
pseudo-label is given by a combination of captions of paired images, obtained
by solving an optimal transport problem between unpaired and paired images. The
keyword-level pseudo-label is given by a keyword in the caption of the nearest
paired image, trained through partial label learning that assumes a candidate
set of labels for supervision instead of the exact one. By combining these
objectives, S-CLIP significantly enhances the training of CLIP using only a few
image-text pairs, as demonstrated in various specialist domains, including
remote sensing, fashion, scientific figures, and comics. For instance, S-CLIP
improves CLIP by 10% for zero-shot classification and 4% for image-text
retrieval on the remote sensing benchmark, matching the performance of
supervised CLIP while using three times fewer image-text pairs.
Related papers
- Selective Vision-Language Subspace Projection for Few-shot CLIP [55.361337202198925]
We introduce a method called Selective Vision-Language Subspace Projection (SSP)
SSP incorporates local image features and utilizes them as a bridge to enhance the alignment between image-text pairs.
Our approach entails only training-free matrix calculations and can be seamlessly integrated into advanced CLIP-based few-shot learning frameworks.
arXiv Detail & Related papers (2024-07-24T03:45:35Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - GOPro: Generate and Optimize Prompts in CLIP using Self-Supervised
Learning [14.532939492926406]
We propose a prompt learning-based model called GOPro to overcome challenges of CLIP's contrastive loss and SSL's loss.
GOro is trained end-to-end on all three loss objectives, combining the strengths of CLIP and SSL in a principled manner.
arXiv Detail & Related papers (2023-08-22T17:53:26Z) - Multi-Label Self-Supervised Learning with Scene Images [21.549234013998255]
This paper shows that quality image representations can be learned by treating scene/multi-label image SSL simply as a multi-label classification problem.
The proposed method is named Multi-Label Self-supervised learning (MLS)
arXiv Detail & Related papers (2023-08-07T04:04:22Z) - Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z) - GridCLIP: One-Stage Object Detection by Grid-Level CLIP Representation
Learning [55.77244064907146]
One-stage detector GridCLIP learns grid-level representations to adapt to the intrinsic principle of one-stage detection learning.
Experiments show that the learned CLIP-based grid-level representations boost the performance of undersampled (infrequent and novel) categories.
arXiv Detail & Related papers (2023-03-16T12:06:02Z) - CLIPPO: Image-and-Language Understanding from Pixels Only [36.433133689137875]
We propose a pure pixel-based model to perform image, text, and multimodal tasks.
Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO)
When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks.
arXiv Detail & Related papers (2022-12-15T18:52:08Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.