When Do We Not Need Larger Vision Models?
- URL: http://arxiv.org/abs/2403.13043v2
- Date: Thu, 18 Jul 2024 02:54:35 GMT
- Title: When Do We Not Need Larger Vision Models?
- Authors: Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, Trevor Darrell,
- Abstract summary: Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations.
We demonstrate the power of Scaling on Scales (S$2$), whereby a pre-trained and frozen smaller vision model can outperform larger models.
We release a Python package that can apply S$2$ on any vision model with one line of code.
- Score: 55.957626371697785
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations. In this work, we discuss the point beyond which larger vision models are not necessary. First, we demonstrate the power of Scaling on Scales (S$^2$), whereby a pre-trained and frozen smaller vision model (e.g., ViT-B or ViT-L), run over multiple image scales, can outperform larger models (e.g., ViT-H or ViT-G) on classification, segmentation, depth estimation, Multimodal LLM (MLLM) benchmarks, and robotic manipulation. Notably, S$^2$ achieves state-of-the-art performance in detailed understanding of MLLM on the V* benchmark, surpassing models such as GPT-4V. We examine the conditions under which S$^2$ is a preferred scaling approach compared to scaling on model size. While larger models have the advantage of better generalization on hard examples, we show that features of larger vision models can be well approximated by those of multi-scale smaller models. This suggests most, if not all, of the representations learned by current large pre-trained models can also be obtained from multi-scale smaller models. Our results show that a multi-scale smaller model has comparable learning capacity to a larger model, and pre-training smaller models with S$^2$ can match or even exceed the advantage of larger models. We release a Python package that can apply S$^2$ on any vision model with one line of code: https://github.com/bfshi/scaling_on_scales.
Related papers
- What Matters for Model Merging at Scale? [94.26607564817786]
Model merging aims to combine multiple expert models into a more capable single model.
Previous studies have primarily focused on merging a few small models.
This study systematically evaluates the utility of model merging at scale.
arXiv Detail & Related papers (2024-10-04T17:17:19Z) - Large Language Model Pruning [0.0]
We suggest a model pruning technique specifically focused on LLMs.
The proposed methodology emphasizes the explainability of deep learning models.
We also explore the difference between pruning on large-scale models vs. pruning on small-scale models.
arXiv Detail & Related papers (2024-05-24T18:22:15Z) - STU-Net: Scalable and Transferable Medical Image Segmentation Models
Empowered by Large-Scale Supervised Pre-training [43.04882328763337]
We design a series of scalable U-Net (STU-Net) models, with parameter sizes ranging from 14 million to 1.4 billion.
We train our scalable STU-Net models on a large-scale TotalSegmentator dataset and find that increasing model size brings a stronger performance gain.
We observe good performance of our pre-trained model in both direct inference and fine-tuning.
arXiv Detail & Related papers (2023-04-13T17:59:13Z) - Specializing Smaller Language Models towards Multi-Step Reasoning [56.78474185485288]
We show that abilities can be distilled down from GPT-3.5 ($ge$ 175B) to T5 variants ($le$ 11B)
We propose model specialization, to specialize the model's ability towards a target task.
arXiv Detail & Related papers (2023-01-30T08:51:19Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - ScaleNet: Searching for the Model to Scale [44.05380012545087]
We propose ScaleNet to jointly search base model and scaling strategy.
We show our scaled networks enjoy significant performance superiority on various FLOPs.
arXiv Detail & Related papers (2022-07-15T03:16:43Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z) - When Ensembling Smaller Models is More Efficient than Single Large
Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.