MULLER: Multilayer Laplacian Resizer for Vision
- URL: http://arxiv.org/abs/2304.02859v1
- Date: Thu, 6 Apr 2023 04:39:21 GMT
- Title: MULLER: Multilayer Laplacian Resizer for Vision
- Authors: Zhengzhong Tu, Peyman Milanfar, Hossein Talebi
- Abstract summary: We present an extremely lightweight multilayer Laplacian resizer with only a handful of trainable parameters, dubbed MULLER resizer.
We show that MULLER can be easily plugged into various training pipelines, and it effectively boosts the performance of the underlying vision task with little to no extra cost.
- Score: 16.67232499096539
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Image resizing operation is a fundamental preprocessing module in modern
computer vision. Throughout the deep learning revolution, researchers have
overlooked the potential of alternative resizing methods beyond the commonly
used resizers that are readily available, such as nearest-neighbors, bilinear,
and bicubic. The key question of our interest is whether the front-end resizer
affects the performance of deep vision models? In this paper, we present an
extremely lightweight multilayer Laplacian resizer with only a handful of
trainable parameters, dubbed MULLER resizer. MULLER has a bandpass nature in
that it learns to boost details in certain frequency subbands that benefit the
downstream recognition models. We show that MULLER can be easily plugged into
various training pipelines, and it effectively boosts the performance of the
underlying vision task with little to no extra cost. Specifically, we select a
state-of-the-art vision Transformer, MaxViT, as the baseline, and show that, if
trained with MULLER, MaxViT gains up to 0.6% top-1 accuracy, and meanwhile
enjoys 36% inference cost saving to achieve similar top-1 accuracy on
ImageNet-1k, as compared to the standard training scheme. Notably, MULLER's
performance also scales with model size and training data size such as
ImageNet-21k and JFT, and it is widely applicable to multiple vision tasks,
including image classification, object detection and segmentation, as well as
image quality assessment.
Related papers
- Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training [103.72844619581811]
We build performant Multimodal Large Language Models (MLLMs)
In particular, we study the importance of various architecture components and data choices.
We demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data.
arXiv Detail & Related papers (2024-03-14T17:51:32Z) - Improve Supervised Representation Learning with Masked Image Modeling [30.30649867772395]
We propose a simple yet effective setup that can easily integrate masked image modeling into existing supervised training paradigms.
We show with minimal change in architecture and no overhead in inference that this setup is able to improve the quality of the learned representations.
arXiv Detail & Related papers (2023-12-01T22:03:25Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - FastMIM: Expediting Masked Image Modeling Pre-training for Vision [65.47756720190155]
FastMIM is a framework for pre-training vision backbones with low-resolution input images.
It reconstructs Histograms of Oriented Gradients (HOG) feature instead of original RGB values of the input images.
It can achieve 83.8%/84.1% top-1 accuracy on ImageNet-1K with ViT-B/Swin-B as backbones.
arXiv Detail & Related papers (2022-12-13T14:09:32Z) - EVA: Exploring the Limits of Masked Visual Representation Learning at
Scale [46.952339726872374]
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale.
EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches.
We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute.
arXiv Detail & Related papers (2022-11-14T18:59:52Z) - Advancing Plain Vision Transformer Towards Remote Sensing Foundation
Model [97.9548609175831]
We resort to plain vision transformers with about 100 million parameters and make the first attempt to propose large vision models customized for remote sensing tasks.
Specifically, to handle the large image size and objects of various orientations in RS images, we propose a new rotated varied-size window attention.
Experiments on detection tasks demonstrate the superiority of our model over all state-of-the-art models, achieving 81.16% mAP on the DOTA-V1.0 dataset.
arXiv Detail & Related papers (2022-08-08T09:08:40Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Learning to Resize Images for Computer Vision Tasks [15.381549764216134]
We show that the typical linear resizer can be replaced with learned resizers that can substantially improve performance.
Our learned image resizer is jointly trained with a baseline vision model.
We show that the proposed resizer can also be useful for fine-tuning the classification baselines for other vision tasks.
arXiv Detail & Related papers (2021-03-17T23:43:44Z) - Pre-Trained Image Processing Transformer [95.93031793337613]
We develop a new pre-trained model, namely, image processing transformer (IPT)
We present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs.
IPT model is trained on these images with multi-heads and multi-tails.
arXiv Detail & Related papers (2020-12-01T09:42:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.