Bootstrapping SparseFormers from Vision Foundation Models
- URL: http://arxiv.org/abs/2312.01987v2
- Date: Thu, 4 Apr 2024 14:40:21 GMT
- Title: Bootstrapping SparseFormers from Vision Foundation Models
- Authors: Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou,
- Abstract summary: We propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way.
bootstrapped unimodal SparseFormer can reach 84.9% accuracy on IN-1K with only 49 tokens.
CLIP-bootstrapped SparseFormers, which align the output space with language without seeing a word, can serve as efficient vision encoders in multimodal large language models.
- Score: 24.029898310518046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs, greatly reducing computational costs while still achieving promising performance. However, training SparseFormers from scratch is still expensive, and scaling up the number of parameters can be challenging. In this paper, we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. Since the majority of SparseFormer blocks are the standard transformer ones, we can inherit weights from large-scale pre-trained vision transformers and freeze them as much as possible. Therefore, we only need to train the SparseFormer-specific lightweight focusing transformer to adjust token RoIs and fine-tune a few early pre-trained blocks to align the final token representation. In such a way, we can bootstrap SparseFormer architectures from various large-scale pre-trained models (e.g., IN-21K pre-trained AugRegs or CLIPs) using a rather smaller amount of training samples (e.g., IN-1K) and without labels or captions within just a few hours. As a result, the bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9% accuracy on IN-1K with only 49 tokens, and the multimodal SparseFormer from CLIPs also demonstrates notable zero-shot performance with highly reduced computational cost without seeing any caption during the bootstrapping procedure. In addition, CLIP-bootstrapped SparseFormers, which align the output space with language without seeing a word, can serve as efficient vision encoders in multimodal large language models. Code and models are available at https://github.com/showlab/sparseformer
Related papers
- SparseLGS: Sparse View Language Embedded Gaussian Splatting [49.187761358726675]
We propose SparseLGS to address the challenge of 3D scene understanding with pose-free and sparse view input images.
Our method leverages a learning-based dense stereo model to handle pose-free and sparse inputs.
Experimental results show that SparseLGS achieves comparable quality when reconstructing semantic fields with fewer inputs.
arXiv Detail & Related papers (2024-12-03T08:18:56Z) - Patch-Level Training for Large Language Models [69.67438563485887]
This paper introduces patch-level training for Large Language Models (LLMs)
During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.
Following this, the model continues token-level training on the remaining training data to align with the inference mode.
arXiv Detail & Related papers (2024-07-17T15:48:39Z) - ReALLM: A general framework for LLM compression and fine-tuning [11.738510106847414]
ReALLM is a novel approach for compression and memory-efficient adaptation of pre-trained language models.
Weight-only quantization algorithm yields the best results on language generation tasks (C4 and WikiText-2) for a budget of $3$ bits without any training.
arXiv Detail & Related papers (2024-05-21T18:50:51Z) - SparseFormer: Sparse Visual Recognition via Limited Latent Tokens [30.494412497158237]
We present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner.
SparseFormer circumvents most of dense operations on the image space and has much lower computational costs.
Experiments on the ImageNet classification benchmark dataset show that SparseFormer achieves performance on par with canonical or well-established models.
arXiv Detail & Related papers (2023-04-07T17:59:58Z) - FlexiViT: One Model for All Patch Sizes [100.52574011880571]
Vision Transformers convert images to sequences by slicing them into patches.
The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost.
We show that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes.
arXiv Detail & Related papers (2022-12-15T18:18:38Z) - SdAE: Self-distillated Masked Autoencoder [95.3684955370897]
Self-distillated masked AutoEncoder network SdAE is proposed in this paper.
With only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification.
arXiv Detail & Related papers (2022-07-31T15:07:25Z) - PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers [102.7922200135147]
This paper explores a better codebook for BERT pre-training of vision transformers.
By contrast, the discrete tokens in NLP field are naturally highly semantic.
We demonstrate that the visual tokens generated by the proposed perceptual codebook do exhibit better semantic meanings.
arXiv Detail & Related papers (2021-11-24T18:59:58Z) - Sentence Bottleneck Autoencoders from Transformer Language Models [53.350633961266375]
We build a sentence-level autoencoder from a pretrained, frozen transformer language model.
We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder.
We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer, and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.
arXiv Detail & Related papers (2021-08-31T19:39:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.