Related papers: Training the Untrainable: Introducing Inductive Bias via Representational Alignment

Training the Untrainable: Introducing Inductive Bias via Representational Alignment

URL: http://arxiv.org/abs/2410.20035v2
Date: Thu, 23 Oct 2025 20:40:02 GMT
Title: Training the Untrainable: Introducing Inductive Bias via Representational Alignment
Authors: Vighnesh Subramaniam, David Mayo, Colin Conwell, Tomaso Poggio, Boris Katz, Brian Cheung, Andrei Barbu,
Abstract summary: We show that architectures which traditionally are considered to be ill-suited for a task can be trained using inductive biases from another architecture.<n>We show that guidance prevents FCN overfitting on ImageNet, narrows the vanilla RNN-Transformer gap, boosts plain CNNs toward ResNet accuracy, and aids Transformers on RNN-favored tasks.
Score: 17.222963037741327
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We demonstrate that architectures which traditionally are considered to be ill-suited for a task can be trained using inductive biases from another architecture. We call a network untrainable when it overfits, underfits, or converges to poor results even when tuning their hyperparameters. For example, fully connected networks overfit on object recognition while deep convolutional networks without residual connections underfit. The traditional answer is to change the architecture to impose some inductive bias, although the nature of that bias is unknown. We introduce guidance, where a guide network steers a target network using a neural distance function. The target minimizes its task loss plus a layerwise representational similarity against the frozen guide. If the guide is trained, this transfers over the architectural prior and knowledge of the guide to the target. If the guide is untrained, this transfers over only part of the architectural prior of the guide. We show that guidance prevents FCN overfitting on ImageNet, narrows the vanilla RNN-Transformer gap, boosts plain CNNs toward ResNet accuracy, and aids Transformers on RNN-favored tasks. We further identify that guidance-driven initialization alone can mitigate FCN overfitting. Our method provides a mathematical tool to investigate priors and architectures, and in the long term, could automate architecture design.

Related papers

Network of Theseus (like the ship) [17.044332179158392]
Network of Theseus (NoT) is a method for converting a trained, or even untrained, guide network architecture part-by-part into an entirely different target network architecture.<n>At each stage, components in the guide network architecture are incrementally replaced with target architecture modules and aligned via representational similarity metrics.<n>NoT expands the space of viable inference-time architectures, opening opportunities for better accuracy-efficiency tradeoffs and enabling more directed exploration of the architectural design space.
arXiv Detail & Related papers (2025-12-03T19:15:18Z)
Information-Theoretic Greedy Layer-wise Training for Traffic Sign Recognition [0.5024983453990065]
layer-wise training eliminates the need for cross-entropy loss and backpropagation.<n>Most existing layer-wise training approaches have been evaluated only on relatively small datasets.<n>We propose a novel layer-wise training approach based on the recently developed deterministic information bottleneck (DIB) and the matrix-based R'enyi's $alpha$-order entropy functional.
arXiv Detail & Related papers (2025-10-31T17:24:58Z)
NN-Former: Rethinking Graph Structure in Neural Architecture Representation [67.3378579108611]
Graph Neural Networks (GNNs) and transformers have shown promising performance in representing neural architectures.<n>We show that sibling nodes are pivotal while overlooked in previous research.<n>Our approach consistently achieves promising performance in both accuracy and latency prediction.
arXiv Detail & Related papers (2025-07-01T15:46:18Z)
Find A Winning Sign: Sign Is All We Need to Win the Lottery [52.63674911541416]
We show that a sparse network trained by an existing IP method can retain its basin of attraction if its parameter signs and normalization layer parameters are preserved. To take a step closer to finding a winning ticket, we alleviate the reliance on normalization layer parameters by preventing high error barriers along the linear path between the sparse network trained by our method and its counterpart with normalization layer parameters.
arXiv Detail & Related papers (2025-04-07T09:30:38Z)
Gradient Rewiring for Editable Graph Neural Network Training [84.77778876113099]
underlineGradient underlineRewiring method for underlineEditable graph neural network training, named textbfGRE. We propose a simple yet effective underlineGradient underlineRewiring method for underlineEditable graph neural network training, named textbfGRE.
arXiv Detail & Related papers (2024-10-21T01:01:50Z)
Stitching for Neuroevolution: Recombining Deep Neural Networks without Breaking Them [0.0]
Traditional approaches to neuroevolution often start from scratch. Recombining trained networks is non-trivial because architectures and feature representations typically differ. We employ stitching, which merges the networks by introducing new layers at crossover points.
arXiv Detail & Related papers (2024-03-21T08:30:44Z)
Rotation Equivariant Proximal Operator for Deep Unfolding Methods in Image Restoration [62.41329042683779]
We propose a high-accuracy rotation equivariant proximal network that embeds rotation symmetry priors into the deep unfolding framework. This study makes efforts to suggest a high-accuracy rotation equivariant proximal network that effectively embeds rotation symmetry priors into the deep unfolding framework.
arXiv Detail & Related papers (2023-12-25T11:53:06Z)
Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization [1.3999481573773072]
We introduce and evaluate the batch-entropy which quantifies the flow of information through each layer of a neural network. We show that we can train a "vanilla" fully connected network and convolutional neural network with 500 layers by simply adding the batch-entropy regularization term to the loss function.
arXiv Detail & Related papers (2022-08-01T20:31:58Z)
Evolving Architectures with Gradient Misalignment toward Low Adversarial Transferability [4.415977307120616]
We propose an architecture searching framework that employs neuroevolution to evolve network architectures. Our experiments show that the proposed framework successfully discovers architectures that reduce transferability from four standard networks. In addition, the evolved networks trained with gradient misalignment exhibit significantly lower transferability compared to standard networks trained with gradient misalignment.
arXiv Detail & Related papers (2021-09-13T12:41:53Z)
Sifting out the features by pruning: Are convolutional networks the winning lottery ticket of fully connected ones? [16.5745082442791]
We study the inductive bias that pruning imprints in such "winning lottery tickets" We show that the surviving node connectivity is local in input space, and organized in patterns reminiscent of the ones found in convolutional networks (CNN)
arXiv Detail & Related papers (2021-04-27T17:25:54Z)
ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases [16.308432111311195]
Vision Transformers (ViTs) rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. We introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a "soft" convolutional inductive bias. The resulting convolutional-like ViT architecture, ConViT, outperforms the DeiT on ImageNet, while offering a much improved sample efficiency.
arXiv Detail & Related papers (2021-03-19T09:11:20Z)
Firefly Neural Architecture Descent: a General Approach for Growing Neural Networks [50.684661759340145]
Firefly neural architecture descent is a general framework for progressively and dynamically growing neural networks. We show that firefly descent can flexibly grow networks both wider and deeper, and can be applied to learn accurate but resource-efficient neural architectures. In particular, it learns networks that are smaller in size but have higher average accuracy than those learned by the state-of-the-art methods.
arXiv Detail & Related papers (2021-02-17T04:47:18Z)
Overcoming Catastrophic Forgetting in Graph Neural Networks [50.900153089330175]
Catastrophic forgetting refers to the tendency that a neural network "forgets" the previous learned knowledge upon learning new tasks. We propose a novel scheme dedicated to overcoming this problem and hence strengthen continual learning in graph neural networks (GNNs) At the heart of our approach is a generic module, termed as topology-aware weight preserving(TWP)
arXiv Detail & Related papers (2020-12-10T22:30:25Z)
Cream of the Crop: Distilling Prioritized Paths For One-Shot Neural Architecture Search [60.965024145243596]
One-shot weight sharing methods have recently drawn great attention in neural architecture search due to high efficiency and competitive performance. To alleviate this problem, we present a simple yet effective architecture distillation method. We introduce the concept of prioritized path, which refers to the architecture candidates exhibiting superior performance during training. Since the prioritized paths are changed on the fly depending on their performance and complexity, the final obtained paths are the cream of the crop.
arXiv Detail & Related papers (2020-10-29T17:55:05Z)
Dynamic Graph: Learning Instance-aware Connectivity for Neural Networks [78.65792427542672]
Dynamic Graph Network (DG-Net) is a complete directed acyclic graph, where the nodes represent convolutional blocks and the edges represent connection paths. Instead of using the same path of the network, DG-Net aggregates features dynamically in each node, which allows the network to have more representation ability.
arXiv Detail & Related papers (2020-10-02T16:50:26Z)
Learning Connectivity of Neural Networks from a Topological Perspective [80.35103711638548]
We propose a topological perspective to represent a network into a complete graph for analysis. By assigning learnable parameters to the edges which reflect the magnitude of connections, the learning process can be performed in a differentiable manner. This learning process is compatible with existing networks and owns adaptability to larger search spaces and different tasks.
arXiv Detail & Related papers (2020-08-19T04:53:31Z)
Dynamic Hierarchical Mimicking Towards Consistent Optimization Objectives [73.15276998621582]
We propose a generic feature learning mechanism to advance CNN training with enhanced generalization ability. Partially inspired by DSN, we fork delicately designed side branches from the intermediate layers of a given neural network. Experiments on both category and instance recognition tasks demonstrate the substantial improvements of our proposed method.
arXiv Detail & Related papers (2020-03-24T09:56:13Z)
Curriculum By Smoothing [52.08553521577014]
Convolutional Neural Networks (CNNs) have shown impressive performance in computer vision tasks such as image classification, detection, and segmentation. We propose an elegant curriculum based scheme that smoothes the feature embedding of a CNN using anti-aliasing or low-pass filters. As the amount of information in the feature maps increases during training, the network is able to progressively learn better representations of the data.
arXiv Detail & Related papers (2020-03-03T07:27:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.