Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning
- URL: http://arxiv.org/abs/2410.06373v1
- Date: Tue, 8 Oct 2024 21:14:23 GMT
- Title: Unveiling the Backbone-Optimizer Coupling Bias in Visual Representation Learning
- Authors: Siyuan Li, Juanxi Tian, Zedong Wang, Luyuan Zhang, Zicheng Liu, Weiyang Jin, Yang Liu, Baigui Sun, Stan Z. Li,
- Abstract summary: This paper delves into the interplay between vision backbones and vision backbones and their inter-dependent phenomenon termed textittextbfbackbonetextbfoptimizer textbfcoupling textbfbias (BOCB)
We observe that canonical CNNs, such as VGG and ResNet, exhibit a marked co-dependency with SGD families, while recent architectures like ViTs and ConvNeXt share a tight coupling with the adaptive learning rate ones.
- Score: 54.956037293979506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper delves into the interplay between vision backbones and optimizers, unvealing an inter-dependent phenomenon termed \textit{\textbf{b}ackbone-\textbf{o}ptimizer \textbf{c}oupling \textbf{b}ias} (BOCB). We observe that canonical CNNs, such as VGG and ResNet, exhibit a marked co-dependency with SGD families, while recent architectures like ViTs and ConvNeXt share a tight coupling with the adaptive learning rate ones. We further show that BOCB can be introduced by both optimizers and certain backbone designs and may significantly impact the pre-training and downstream fine-tuning of vision models. Through in-depth empirical analysis, we summarize takeaways on recommended optimizers and insights into robust vision backbone architectures. We hope this work can inspire the community to question long-held assumptions on backbones and optimizers, stimulate further explorations, and thereby contribute to more robust vision systems. The source code and models are publicly available at https://bocb-ai.github.io/.
Related papers
- Unveiling Backbone Effects in CLIP: Exploring Representational Synergies
and Variances [49.631908848868505]
Contrastive Language-Image Pretraining (CLIP) stands out as a prominent method for image representation learning.
We investigate the differences in CLIP performance among various neural architectures.
We propose a simple, yet effective approach to combine predictions from multiple backbones, leading to a notable performance boost of up to 6.34%.
arXiv Detail & Related papers (2023-12-22T03:01:41Z) - Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective [64.04617968947697]
We introduce a novel data-model co-design perspective: to promote superior weight sparsity.
Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework.
arXiv Detail & Related papers (2023-12-03T13:50:24Z) - RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer [95.71132572688143]
This paper studies how to keep a vision backbone effective while removing token mixers in its basic building blocks.
Token mixers, as self-attention for vision transformers (ViTs), are intended to perform information communication between different spatial tokens but suffer from considerable computational cost and latency.
arXiv Detail & Related papers (2023-04-12T07:34:13Z) - Ordinal Graph Gamma Belief Network for Social Recommender Systems [54.9487910312535]
We develop a hierarchical Bayesian model termed ordinal graph factor analysis (OGFA), which jointly models user-item and user-user interactions.
OGFA not only achieves good recommendation performance, but also extracts interpretable latent factors corresponding to representative user preferences.
We extend OGFA to ordinal graph gamma belief network, which is a multi-stochastic-layer deep probabilistic model.
arXiv Detail & Related papers (2022-09-12T09:19:22Z) - Self-Supervised Hypergraph Transformer for Recommender Systems [25.07482350586435]
Self-Supervised Hypergraph Transformer (SHT)
Self-Supervised Hypergraph Transformer (SHT)
Cross-view generative self-supervised learning component is proposed for data augmentation over the user-item interaction graph.
arXiv Detail & Related papers (2022-07-28T18:40:30Z) - Hypergraph Contrastive Collaborative Filtering [44.8586906335262]
We propose a new self-supervised recommendation framework Hypergraph Contrastive Collaborative Filtering (HCCF)
HCCF captures local and global collaborative relations with a hypergraph-enhanced cross-view contrastive learning architecture.
Our model effectively integrates the hypergraph structure encoding with self-supervised learning to reinforce the representation quality of recommender systems.
arXiv Detail & Related papers (2022-04-26T10:06:04Z) - Learning Target-aware Representation for Visual Tracking via Informative
Interactions [49.552877881662475]
We introduce a novel backbone architecture to improve target-perception ability of feature representation for tracking.
The proposed GIM module and InBN mechanism are general and applicable to different backbone types including CNN and Transformer.
arXiv Detail & Related papers (2022-01-07T16:22:27Z) - Rethinking Depthwise Separable Convolutions: How Intra-Kernel
Correlations Lead to Improved MobileNets [6.09170287691728]
We introduce blueprint separable convolutions (BSConv) as highly efficient building blocks for CNNs.
They are motivated by quantitative analyses of kernel properties from trained models.
Our approach provides a thorough theoretical derivation, interpretation, and justification for the application of depthwise separable convolutions.
arXiv Detail & Related papers (2020-03-30T15:23:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.