Improving Pixel-based MIM by Reducing Wasted Modeling Capability
- URL: http://arxiv.org/abs/2308.00261v1
- Date: Tue, 1 Aug 2023 03:44:56 GMT
- Title: Improving Pixel-based MIM by Reducing Wasted Modeling Capability
- Authors: Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua
Lin
- Abstract summary: We propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction.
To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures.
Our method yields significant performance gains, such as 1.2% on fine-tuning, 2.8% on linear probing, and 2.6% on semantic segmentation.
- Score: 77.99468514275185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: There has been significant progress in Masked Image Modeling (MIM). Existing
MIM methods can be broadly categorized into two groups based on the
reconstruction target: pixel-based and tokenizer-based approaches. The former
offers a simpler pipeline and lower computational cost, but it is known to be
biased toward high-frequency details. In this paper, we provide a set of
empirical studies to confirm this limitation of pixel-based MIM and propose a
new method that explicitly utilizes low-level features from shallow layers to
aid pixel reconstruction. By incorporating this design into our base method,
MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its
convergence and achieving non-trivial improvements across various downstream
tasks. To the best of our knowledge, we are the first to systematically
investigate multi-level feature fusion for isotropic architectures like the
standard Vision Transformer (ViT). Notably, when applied to a smaller model
(e.g., ViT-S), our method yields significant performance gains, such as 1.2\%
on fine-tuning, 2.8\% on linear probing, and 2.6\% on semantic segmentation.
Code and models are available at https://github.com/open-mmlab/mmpretrain.
Related papers
- Multi-Head Attention Residual Unfolded Network for Model-Based Pansharpening [2.874893537471256]
Unfolding fusion methods integrate the powerful representation capabilities of deep learning with the robustness of model-based approaches.
In this paper, we propose a model-based deep unfolded method for satellite image fusion.
Experimental results on PRISMA, Quickbird, and WorldView2 datasets demonstrate the superior performance of our method.
arXiv Detail & Related papers (2024-09-04T13:05:00Z) - Parameter-Inverted Image Pyramid Networks [49.35689698870247]
We propose a novel network architecture known as the Inverted Image Pyramid Networks (PIIP)
Our core idea is to use models with different parameter sizes to process different resolution levels of the image pyramid.
PIIP achieves superior performance in tasks such as object detection, segmentation, and image classification.
arXiv Detail & Related papers (2024-06-06T17:59:10Z) - Dual-Scale Transformer for Large-Scale Single-Pixel Imaging [11.064806978728457]
We propose a deep unfolding network with hybrid-attention Transformer on Kronecker SPI model, dubbed HATNet, to improve the imaging quality of real SPI cameras.
The gradient descent module can avoid high computational overheads rooted in previous gradient descent modules based on vectorized SPI.
The denoising module is an encoder-decoder architecture powered by dual-scale spatial attention for high- and low-frequency aggregation and channel attention for global information recalibration.
arXiv Detail & Related papers (2024-04-07T15:53:21Z) - Deep Neural Networks Fused with Textures for Image Classification [20.58839604333332]
Fine-grained image classification is a challenging task in computer vision.
We propose a fusion approach to address FGIC by combining global texture with local patch-based information.
Our method has attained better classification accuracy over existing methods with notable margins.
arXiv Detail & Related papers (2023-08-03T15:21:08Z) - PixMIM: Rethinking Pixel Reconstruction in Masked Image Modeling [83.67628239775878]
Masked Image Modeling (MIM) has achieved promising progress with the advent of Masked Autoencoders (MAE) and BEiT.
This paper undertakes a fundamental analysis of MIM from the perspective of pixel reconstruction.
We propose a remarkably simple and effective method, ourmethod, that entails two strategies.
arXiv Detail & Related papers (2023-03-04T13:38:51Z) - Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and
Transformer-Based Method [51.30748775681917]
We consider the task of low-light image enhancement (LLIE) and introduce a large-scale database consisting of images at 4K and 8K resolution.
We conduct systematic benchmarking studies and provide a comparison of current LLIE algorithms.
As a second contribution, we introduce LLFormer, a transformer-based low-light enhancement method.
arXiv Detail & Related papers (2022-12-22T09:05:07Z) - Highly Efficient Natural Image Matting [15.977598189574659]
We propose a trimap-free natural image matting method with a lightweight model.
We construct an extremely light-weighted model, which achieves comparable performance with 1% (344k) of large models on popular natural image benchmarks.
arXiv Detail & Related papers (2021-10-25T09:23:46Z) - FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning [64.32306537419498]
We propose a novel learned feature-based refinement and augmentation method that produces a varied set of complex transformations.
These transformations also use information from both within-class and across-class representations that we extract through clustering.
We demonstrate that our method is comparable to current state of art for smaller datasets while being able to scale up to larger datasets.
arXiv Detail & Related papers (2020-07-16T17:55:31Z) - Normalizing Flows with Multi-Scale Autoregressive Priors [131.895570212956]
We introduce channel-wise dependencies in their latent space through multi-scale autoregressive priors (mAR)
Our mAR prior for models with split coupling flow layers (mAR-SCF) can better capture dependencies in complex multimodal data.
We show that mAR-SCF allows for improved image generation quality, with gains in FID and Inception scores compared to state-of-the-art flow-based models.
arXiv Detail & Related papers (2020-04-08T09:07:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.