Related papers: Combined Scaling for Zero-shot Transfer Learning

Combined Scaling for Zero-shot Transfer Learning

URL: http://arxiv.org/abs/2111.10050v3
Date: Wed, 12 Apr 2023 08:26:28 GMT
Title: Combined Scaling for Zero-shot Transfer Learning
Authors: Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, Mingxing Tan, Quoc V. Le
Abstract summary: We present a combined scaling method - named BASIC - that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set. This accuracy surpasses best published similar models - CLIP and ALIGN - by 9.3%. Our model also shows significant improvements in robustness benchmarks.
Score: 146.0851484769142
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a combined scaling method - named BASIC - that achieves 85.7% top-1 accuracy on the ImageNet ILSVRC-2012 validation set without learning from any labeled ImageNet example. This accuracy surpasses best published similar models - CLIP and ALIGN - by 9.3%. Our BASIC model also shows significant improvements in robustness benchmarks. For instance, on 5 test sets with natural distribution shifts such as ImageNet-{A,R,V2,Sketch} and ObjectNet, our model achieves 84.3% top-1 average accuracy, only a small drop from its original ImageNet accuracy. To achieve these results, we scale up the contrastive learning framework of CLIP and ALIGN in three dimensions: data size, model size, and batch size. Our dataset has 6.6B noisy image-text pairs, which is 4x larger than ALIGN, and 16x larger than CLIP. Our largest model has 3B weights, which is 3.75x larger in parameters and 8x larger in FLOPs than ALIGN and CLIP. Finally, our batch size is 65536 which is 2x more than CLIP and 4x more than ALIGN. We encountered two main challenges with the scaling rules of BASIC. First, the main challenge with implementing the combined scaling rules of BASIC is the limited memory of accelerators, such as GPUs and TPUs. To overcome the memory limit, we propose two simple methods which make use of gradient checkpointing and model parallelism. Second, while increasing the dataset size and the model size has been the defacto method to improve the performance of deep learning models like BASIC, the effect of a large contrastive batch size on such contrastive-trained image-text models is not well-understood. To shed light on the benefits of large contrastive batch sizes, we develop a theoretical framework which shows that larger contrastive batch sizes lead to smaller generalization gaps for image-text models such as BASIC.

Related papers

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models [111.97026994761254]
Mixture-of-Transformers (MoT) is a sparse multi-modal transformer architecture. MoT decouples non-embedding parameters of the model by modality. We evaluate MoT across multiple settings and model scales.
arXiv Detail & Related papers (2024-11-07T18:59:06Z)
Speeding Up Image Classifiers with Little Companions [5.9999780224657195]
Scaling up neural networks has been a key recipe to the success of large language and vision models. We develop a simple model-a two-pass Little-Big that first uses a light-weight "little" model to make predictions of all samples. Little-Big also speeds up the InternImage-G-512 while achieving 90% ImageNet-1K top-1 accuracy.
arXiv Detail & Related papers (2024-06-24T20:11:46Z)
FinGPT-HPC: Efficient Pretraining and Finetuning Large Language Models for Financial Applications with High-Performance Computing [10.47214968497857]
We present high-performance methods that exploit low-rank structures to pretrain and finetune large language models. Our methods achieve a speedup of 1.3X and a model compression ratio of 2.64X for pretaining without accuracy drop. For finetuning, our methods achieve an average accuracy increase of 6.3% and 24.0% in general tasks and financial tasks.
arXiv Detail & Related papers (2024-02-21T05:03:17Z)
Joint Adaptive Representations for Image-Language Learning [59.40890927221377]
We propose a recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets. Our key finding is the joint learning of a compact vision and language representation, which adaptively and iteratively fuses the multi-modal features. With only 40M training examples and with 39 GFLOPs our lightweight model outperforms many times larger state-of-the-art models of 2-20x more FLOPs and using bigger datasets some of which with close to 1B training examples.
arXiv Detail & Related papers (2023-05-31T15:02:02Z)
Sigmoid Loss for Language Image Pre-Training [93.91385557929604]
We propose a simple pairwise Sigmoid loss for Language-Image Pre-training (SigLIP) The sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. Combined with Locked-image Tuning, with only four TPUv4 chips, we train a SigLiT model that achieves 84.5% ImageNet zero-shot accuracy in two days.
arXiv Detail & Related papers (2023-03-27T15:53:01Z)
The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model. The final model size depends on both the number of parameters of the original model and the rate of compression. We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z)
Improving Zero-shot Generalization and Robustness of Multi-modal Models [70.14692320804178]
Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks. We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. We propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy.
arXiv Detail & Related papers (2022-12-04T07:26:24Z)
SimMIM: A Simple Framework for Masked Image Modeling [29.015777125540613]
This paper presents Sim, a simple framework for masked image modeling. We study the major components in our framework, and find that simple designs of each component have revealed very strong representation learning performance. We also leverage this approach to facilitate the training of a 3B model, that by $40times$ less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks.
arXiv Detail & Related papers (2021-11-18T18:59:45Z)
Scalable and Practical Natural Gradient for Large-Scale Deep Learning [19.220930193896404]
SP-NGD scales to large mini-batch sizes with a negligible computational overhead as compared to first-order methods. We demonstrate convergence to a top-1 validation accuracy of 75.4% in 5.5 minutes using a mini-batch size of 32,768 with 1,024 GPUs, as well as an accuracy of 74.9% with an extremely large mini-batch size of 131,072 in 873 steps of SP-NGD.
arXiv Detail & Related papers (2020-02-13T11:55:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.