iBOT: Image BERT Pre-Training with Online Tokenizer
- URL: http://arxiv.org/abs/2111.07832v1
- Date: Mon, 15 Nov 2021 15:18:05 GMT
- Title: iBOT: Image BERT Pre-Training with Online Tokenizer
- Authors: Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille,
Tao Kong
- Abstract summary: We study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer.
We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer.
We show the prominence of iBOT by achieving an 81.6% linear probing accuracy and an 86.3% fine-tuning accuracy evaluated on ImageNet-1K.
- Score: 23.997853010642046
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The success of language Transformers is primarily attributed to the pretext
task of masked language modeling (MLM), where texts are first tokenized into
semantically meaningful pieces. In this work, we study masked image modeling
(MIM) and indicate the advantages and challenges of using a semantically
meaningful visual tokenizer. We present a self-supervised framework iBOT that
can perform masked prediction with an online tokenizer. Specifically, we
perform self-distillation on masked patch tokens and take the teacher network
as the online tokenizer, along with self-distillation on the class token to
acquire visual semantics. The online tokenizer is jointly learnable with the
MIM objective and dispenses with a multi-stage training pipeline where the
tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by
achieving an 81.6% linear probing accuracy and an 86.3% fine-tuning accuracy
evaluated on ImageNet-1K. Beyond the state-of-the-art image classification
results, we underline emerging local semantic patterns, which helps the models
to obtain strong robustness against common corruptions and achieve leading
results on dense downstream tasks, eg., object detection, instance
segmentation, and semantic segmentation.
Related papers
- Enhancing Vision-Language Model with Unmasked Token Alignment [37.12838142681491]
This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations.
UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder.
arXiv Detail & Related papers (2024-05-29T11:48:17Z) - Tokenize Anything via Prompting [65.93061853439512]
We present a unified, promptable model capable of simultaneously segmenting, recognizing, and captioning anything.
We train a generalizable model with massive segmentation masks, eg, SA-1B masks, and semantic priors from a pre-trained CLIP model with 5 billion parameters.
We believe this model can be a versatile region-level image tokenizer, capable of encoding general-purpose region context.
arXiv Detail & Related papers (2023-12-14T17:01:02Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Learning Hierarchical Image Segmentation For Recognition and By Recognition [39.712584686731574]
We propose to integrate a hierarchical segmenter into the recognition process, train and adapt the entire model solely on image-level recognition objectives.
We learn hierarchical segmentation for free alongside recognition, automatically uncovering part-to-whole relationships that not only underpin but also enhance recognition.
Notably, our model (trained on unlabeled 1M ImageNet images) outperforms SAM (trained on 11M images masks) by absolute 8% in mIoU on PartImageNet object segmentation.
arXiv Detail & Related papers (2022-10-01T16:31:44Z) - MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image
Pretraining [138.86293836634323]
MaskCLIP incorporates a newly proposed masked self-distillation into contrastive language-image pretraining.
MaskCLIP achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder.
arXiv Detail & Related papers (2022-08-25T17:59:58Z) - BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers [117.79456335844439]
We propose to use a semantic-rich visual tokenizer as the reconstruction target for masked prediction.
We then pretrain vision Transformers by predicting the original visual tokens for the masked image patches.
Experiments on image classification and semantic segmentation show that our approach outperforms all compared MIM methods.
arXiv Detail & Related papers (2022-08-12T16:48:10Z) - mc-BEiT: Multi-choice Discretization for Image BERT Pre-training [52.04866462439979]
Image BERT pre-training with masked image modeling (MIM) is a popular practice to cope with self-supervised representation learning.
We introduce an improved BERT-style image pre-training method, namely mc-BEiT, which performs MIM proxy tasks towards eased and refined multi-choice training objectives.
arXiv Detail & Related papers (2022-03-29T09:08:18Z) - Learning Representations by Predicting Bags of Visual Words [55.332200948110895]
Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
arXiv Detail & Related papers (2020-02-27T16:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.