Related papers: Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling

Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling

URL: http://arxiv.org/abs/2210.01370v1
Date: Tue, 4 Oct 2022 04:20:20 GMT
Title: Towards Flexible Inductive Bias via Progressive Reparameterization Scheduling
Authors: Yunsung Lee, Gyuseong Lee, Kwangrok Ryoo, Hyojun Go, Jihye Park, and Seungryong Kim
Abstract summary: There are two de facto standard architectures in computer vision: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) We show these approaches overlook that the optimal inductive bias also changes according to the target data scale changes. The more convolution-like inductive bias is included in the model, the smaller the data scale is required where the ViT-like model outperforms the ResNet performance.
Score: 25.76814731638375
License: http://creativecommons.org/licenses/by/4.0/
Abstract: There are two de facto standard architectures in recent computer vision: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Strong inductive biases of convolutions help the model learn sample effectively, but such strong biases also limit the upper bound of CNNs when sufficient data are available. On the contrary, ViT is inferior to CNNs for small data but superior for sufficient data. Recent approaches attempt to combine the strengths of these two architectures. However, we show these approaches overlook that the optimal inductive bias also changes according to the target data scale changes by comparing various models' accuracy on subsets of sampled ImageNet at different ratios. In addition, through Fourier analysis of feature maps, the model's response patterns according to signal frequency changes, we observe which inductive bias is advantageous for each data scale. The more convolution-like inductive bias is included in the model, the smaller the data scale is required where the ViT-like model outperforms the ResNet performance. To obtain a model with flexible inductive bias on the data scale, we show reparameterization can interpolate inductive bias between convolution and self-attention. By adjusting the number of epochs the model stays in the convolution, we show that reparameterization from convolution to self-attention interpolates the Fourier analysis pattern between CNNs and ViTs. Adapting these findings, we propose Progressive Reparameterization Scheduling (PRS), in which reparameterization adjusts the required amount of convolution-like or self-attention-like inductive bias per layer. For small-scale datasets, our PRS performs reparameterization from convolution to self-attention linearly faster at the late stage layer. PRS outperformed previous studies on the small-scale dataset, e.g., CIFAR-100.

Related papers

Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models. Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information. Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z)
Convolutional Initialization for Data-Efficient Vision Transformers [38.63299194992718]
Training vision transformer networks on small datasets poses challenges. CNNs can achieve state-of-the-art performance by leveraging their architectural inductive bias. Our approach is motivated by the finding that random impulse filters can achieve almost comparable performance to learned filters in CNNs.
arXiv Detail & Related papers (2024-01-23T06:03:16Z)
Assessing Neural Network Representations During Training Using Noise-Resilient Diffusion Spectral Entropy [55.014926694758195]
Entropy and mutual information in neural networks provide rich information on the learning process. We leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures. We show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data.
arXiv Detail & Related papers (2023-12-04T01:32:42Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Bias-variance decomposition of overparameterized regression with random linear features [0.0]
"Over parameterized models" avoid overfitting even when the number of fit parameters is large enough to perfectly fit the training data. We show how each transition arises due to small nonzero eigenvalues in the Hessian matrix. We compare and contrast the phase diagram of the random linear features model to the random nonlinear features model and ordinary regression.
arXiv Detail & Related papers (2022-03-10T16:09:21Z)
How Well Do Sparse Imagenet Models Transfer? [75.98123173154605]
Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" datasets. In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (CNNs) trained on the ImageNet dataset. We show that sparse models can match or even outperform the transfer performance of dense models, even at high sparsities.
arXiv Detail & Related papers (2021-11-26T11:58:51Z)
Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples. We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
Disentangled Recurrent Wasserstein Autoencoder [17.769077848342334]
recurrent Wasserstein Autoencoder (R-WAE) is a new framework for generative modeling of sequential data. R-WAE disentangles the representation of an input sequence into static and dynamic factors. Our models outperform other baselines with the same settings in terms of disentanglement and unconditional video generation.
arXiv Detail & Related papers (2021-01-19T07:43:25Z)
Memorizing without overfitting: Bias, variance, and interpolation in over-parameterized models [0.0]
The bias-variance trade-off is a central concept in supervised learning. Modern Deep Learning methods flout this dogma, achieving state-of-the-art performance.
arXiv Detail & Related papers (2020-10-26T22:31:04Z)
ACDC: Weight Sharing in Atom-Coefficient Decomposed Convolution [57.635467829558664]
We introduce a structural regularization across convolutional kernels in a CNN. We show that CNNs now maintain performance with dramatic reduction in parameters and computations.
arXiv Detail & Related papers (2020-09-04T20:41:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.