Bootstrapping SparseFormers from Vision Foundation Models
- URL: http://arxiv.org/abs/2312.01987v2
- Date: Thu, 4 Apr 2024 14:40:21 GMT
- Title: Bootstrapping SparseFormers from Vision Foundation Models
- Authors: Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou,
- Abstract summary: We propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way.
bootstrapped unimodal SparseFormer can reach 84.9% accuracy on IN-1K with only 49 tokens.
CLIP-bootstrapped SparseFormers, which align the output space with language without seeing a word, can serve as efficient vision encoders in multimodal large language models.
- Score: 24.029898310518046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs, greatly reducing computational costs while still achieving promising performance. However, training SparseFormers from scratch is still expensive, and scaling up the number of parameters can be challenging. In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. Since the majority of SparseFormer blocks are the standard transformer ones, we can inherit weights from large-scale pre-trained vision transformers and freeze them as much as possible. Therefore, we only need to train the SparseFormer-specific lightweight focusing transformer to adjust token RoIs and fine-tune a few early pre-trained blocks to align the final token representation. In such a way, we can bootstrap SparseFormer architectures from various large-scale pre-trained models (e.g., IN-21K pre-trained AugRegs or CLIPs) using a rather smaller amount of training samples (e.g., IN-1K) and without labels or captions within just a few hours. As a result, the bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9% accuracy on IN-1K with only 49 tokens, and the multimodal SparseFormer from CLIPs also demonstrates notable zero-shot performance with highly reduced computational cost without seeing any caption during the bootstrapping procedure. In addition, CLIP-bootstrapped SparseFormers, which align the output space with language without seeing a word, can serve as efficient vision encoders in multimodal large language models. Code and models are available at https://github.com/showlab/sparseformer
Related papers
- Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs)
During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.
Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z) - ReALLM: A general framework for LLM compression and fine-tuning [11.738510106847414]
ReALLM is a novel approach for compression and memory-efficient adaptation of pre-trained language models.
Weight-only quantization algorithm yields the best results on language generation tasks (C4 and WikiText-2) for a budget of $3$ bits without any training.
arXiv Detail & Related papers (2024-05-21T18:50:51Z) - SparseFormer: Sparse Visual Recognition via Limited Latent Tokens [30.494412497158237]
We present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner.
SparseFormer circumvents most of dense operations on the image space and has much lower computational costs.
Experiments on the ImageNet classification benchmark dataset show that SparseFormer achieves performance on par with canonical or well-established models.
arXiv Detail & Related papers (2023-04-07T17:59:58Z) - FlexiViT: One Model for All Patch Sizes [100.52574011880571]
Vision Transformers convert images to sequences by slicing them into patches.
The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost.
We show that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.
arXiv Detail & Related papers (2022-12-15T18:18:38Z) - Nonparametric Masked Language Modeling [113.71921977520864]
Existing language models (LMs) predict tokens with a softmax over a finite vocabulary.
We introduce NPM, the first nonparametric masked language model that replaces this softmax with a nonparametric distribution over every phrase in a reference corpus.
NPM can be efficiently trained with a contrastive objective and an in-batch approximation to full corpus retrieval.
arXiv Detail & Related papers (2022-12-02T18:10:42Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [102.7922200135147]
This paper explores a better codebook for BERT pre-training of vision transformers.
By contrast, the discrete tokens in NLP field are naturally highly semantic.
We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings.
arXiv Detail & Related papers (2021-11-24T18:59:58Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z) - How fine can fine-tuning be? Learning efficient language models [8.25186900320093]
Given a language model pre-trained on massive unlabeled text corpora, only very light supervised fine-tuning is needed to learn a task.
We show that it suffices to fine-tune only the most critical layers.
As a result, fine-tuning of huge language models can be achieved by simply setting a certain number of entries in certain layers of the pre-trained parameters to zero.
arXiv Detail & Related papers (2020-04-24T20:31:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.