Sigmoid Loss for Language Image Pre-Training
- URL: http://arxiv.org/abs/2303.15343v4
- Date: Wed, 27 Sep 2023 12:05:41 GMT
- Title: Sigmoid Loss for Language Image Pre-Training
- Authors: Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer
- Abstract summary: We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP)
The sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization.
Combined with Locked-image Tuning, with only four TPUv4 chips, we train a SigLiT model that achieves 84.5% ImageNet zero-shot accuracy in two days.
- Score: 93.91385557929604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a simple pairwise Sigmoid loss for Language-Image Pre-training
(SigLIP). Unlike standard contrastive learning with softmax normalization, the
sigmoid loss operates solely on image-text pairs and does not require a global
view of the pairwise similarities for normalization. The sigmoid loss
simultaneously allows further scaling up the batch size, while also performing
better at smaller batch sizes. Combined with Locked-image Tuning, with only
four TPUv4 chips, we train a SigLiT model that achieves 84.5% ImageNet
zero-shot accuracy in two days. The disentanglement of the batch size from the
loss further allows us to study the impact of examples vs pairs and negative to
positive ratio. Finally, we push the batch size to the extreme, up to one
million, and find that the benefits of growing batch size quickly diminish,
with a more reasonable batch size of 32k being sufficient. We release our
models at https://github.com/google-research/big_vision and hope our research
motivates further explorations in improving the quality and efficiency of
language-image pre-training.
Related papers
- ELIP: Efficient Language-Image Pre-training with Fewer Vision Tokens [75.09406436851445]
We propose a vision token pruning and merging method ELIP, to remove less influential tokens based on the supervision of language outputs.
Our experiments demonstrate that with the removal of 30$%$ vision tokens across 12 ViT layers, ELIP maintains significantly comparable performance.
arXiv Detail & Related papers (2023-09-28T05:31:07Z) - A Recipe for Efficient SBIR Models: Combining Relative Triplet Loss with
Batch Normalization and Knowledge Distillation [3.364554138758565]
Sketch-Based Image Retrieval (SBIR) is a crucial task in multimedia retrieval, where the goal is to retrieve a set of images that match a given sketch query.
We introduce a Relative Triplet Loss (RTL), an adapted triplet loss to overcome limitations through loss weighting based on anchors similarity.
We propose a straightforward approach to train small models efficiently with a marginal loss of accuracy through knowledge distillation.
arXiv Detail & Related papers (2023-05-30T12:41:04Z) - Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR [103.51937218213774]
This paper advances the fine-grained sketch-based image retrieval (FG-SBIR) literature by putting forward a strong baseline that overshoots prior state-of-the-arts by 11%.
We propose a simple modification to the standard triplet loss, that explicitly enforces separation amongst photos/sketch instances.
For (i) we employ an intra-modal triplet loss amongst sketches to bring sketches of the same instance closer from others, and one more amongst photos to push away different photo instances.
arXiv Detail & Related papers (2023-03-24T03:34:33Z) - Scaling Language-Image Pre-training via Masking [63.36988191660858]
Fast Language-Image Pre-training (FLIP) is a simple and more efficient method for training CLIP.
Masking allows us to learn from more image-text pairs given the same wall-clock time.
FLIP dominantly outperforms CLIP counterparts trained on the same data.
arXiv Detail & Related papers (2022-12-01T18:59:57Z) - Combined Scaling for Zero-shot Transfer Learning [146.0851484769142]
We present a combined scaling method - named BASIC - that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set.
This accuracy surpasses best published similar models - CLIP and ALIGN - by 9.3%.
Our model also shows significant improvements in robustness benchmarks.
arXiv Detail & Related papers (2021-11-19T05:25:46Z) - End-to-End Supermask Pruning: Learning to Prune Image Captioning Models [17.00974730372399]
We show that an 80% to 95% sparse network can either match or outperform its dense counterpart.
The code and pre-trained models for Up-Down and Object Relation Transformer are capable of achieving CIDEr scores >120 on the MS-COCO dataset.
arXiv Detail & Related papers (2021-10-07T09:34:00Z) - EqCo: Equivalent Rules for Self-supervised Contrastive Learning [81.45848885547754]
We propose a method to make self-supervised learning irrelevant to the number of negative samples in InfoNCE-based contrastive learning frameworks.
Inspired by the InfoMax principle, we point that the margin term in contrastive loss needs to be adaptively scaled according to the number of negative pairs.
arXiv Detail & Related papers (2020-10-05T11:39:04Z) - Scalable and Practical Natural Gradient for Large-Scale Deep Learning [19.220930193896404]
SP-NGD scales to large mini-batch sizes with a negligible computational overhead as compared to first-order methods.
We demonstrate convergence to a top-1 validation accuracy of 75.4% in 5.5 minutes using a mini-batch size of 32,768 with 1,024 GPUs, as well as an accuracy of 74.9% with an extremely large mini-batch size of 131,072 in 873 steps of SP-NGD.
arXiv Detail & Related papers (2020-02-13T11:55:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.