Designing BERT for Convolutional Networks: Sparse and Hierarchical
Masked Modeling
- URL: http://arxiv.org/abs/2301.03580v2
- Date: Tue, 10 Jan 2023 08:02:09 GMT
- Title: Designing BERT for Convolutional Networks: Sparse and Hierarchical
Masked Modeling
- Authors: Keyu Tian, Yi Jiang, Qishuai Diao, Chen Lin, Liwei Wang and Zehuan
Yuan
- Abstract summary: We extend the success of BERT-style pre-training, or the masked image modeling, to convolutional networks (convnets)
We treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode.
This is the first use of sparse convolution for 2D masked modeling.
- Score: 23.164631160130092
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We identify and overcome two key obstacles in extending the success of
BERT-style pre-training, or the masked image modeling, to convolutional
networks (convnets): (i) convolution operation cannot handle irregular,
random-masked input images; (ii) the single-scale nature of BERT pre-training
is inconsistent with convnet's hierarchical structure. For (i), we treat
unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution
to encode. This is the first use of sparse convolution for 2D masked modeling.
For (ii), we develop a hierarchical decoder to reconstruct images from
multi-scale encoded features. Our method called Sparse masKed modeling (SparK)
is general: it can be used directly on any convolutional model without backbone
modifications. We validate it on both classical (ResNet) and modern (ConvNeXt)
models: on three downstream tasks, it surpasses both state-of-the-art
contrastive learning and transformer-based masked modeling by similarly large
margins (around +1.0%). Improvements on object detection and instance
segmentation are more substantial (up to +3.5%), verifying the strong
transferability of features learned. We also find its favorable scaling
behavior by observing more gains on larger models. All this evidence reveals a
promising future of generative pre-training on convnets. Codes and models are
released at https://github.com/keyu-tian/SparK.
Related papers
- HySparK: Hybrid Sparse Masking for Large Scale Medical Image Pre-Training [21.444098313697044]
We propose a generative pre-training strategy based on masked image modeling and apply it to large-scale pre-training on medical images.
We employ a simple hierarchical decoder with skip-connections to achieve dense multi-scale feature reconstruction.
arXiv Detail & Related papers (2024-08-11T16:31:39Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Not All Image Regions Matter: Masked Vector Quantization for
Autoregressive Image Generation [78.13793505707952]
Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook.
We propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) Stack model from modeling redundancy.
arXiv Detail & Related papers (2023-05-23T02:15:53Z) - Learning 3D Representations from 2D Pre-trained Models via
Image-to-Point Masked Autoencoders [52.91248611338202]
We propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE.
By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding.
I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity.
arXiv Detail & Related papers (2022-12-13T17:59:20Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - Locally Masked Convolution for Autoregressive Models [107.4635841204146]
LMConv is a simple modification to the standard 2D convolution that allows arbitrary masks to be applied to the weights at each location in the image.
We learn an ensemble of distribution estimators that share parameters but differ in generation order, achieving improved performance on whole-image density estimation.
arXiv Detail & Related papers (2020-06-22T17:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.