No Free Lunch From Random Feature Ensembles
- URL: http://arxiv.org/abs/2412.05418v1
- Date: Fri, 06 Dec 2024 20:55:27 GMT
- Title: No Free Lunch From Random Feature Ensembles
- Authors: Benjamin S. Ruben, William L. Tong, Hamza Tahir Chaudhry, Cengiz Pehlevan,
- Abstract summary: Given a budget on total model size, one must decide whether to train a single, large neural network or to combine the predictions of many smaller networks.<n>We prove that when a fixed number of trainable parameters are partitioned among $K$ independently trained models, $K=1$ achieves optimal performance.<n>We identify conditions on the kernel and task eigenstructure under which ensembles can achieve near-optimal scaling laws.
- Score: 23.661623767100384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a budget on total model size, one must decide whether to train a single, large neural network or to combine the predictions of many smaller networks. We study this trade-off for ensembles of random-feature ridge regression models. We prove that when a fixed number of trainable parameters are partitioned among $K$ independently trained models, $K=1$ achieves optimal performance, provided the ridge parameter is optimally tuned. We then derive scaling laws which describe how the test risk of an ensemble of regression models decays with its total size. We identify conditions on the kernel and task eigenstructure under which ensembles can achieve near-optimal scaling laws. Training ensembles of deep convolutional neural networks on CIFAR-10 and a transformer architecture on C4, we find that a single large network outperforms any ensemble of networks with the same total number of parameters, provided the weight decay and feature-learning strength are tuned to their optimal values.
Related papers
- The Law of Multi-Model Collaboration: Scaling Limits of Model Ensembling for Large Language Models [54.51795784459866]
We propose a theoretical framework of performance scaling for multi-model collaboration.<n>We show that multi-model systems follow a power-law scaling with respect to the total parameter count.<n> ensembles of heterogeneous model families achieve better performance scaling than those formed within a single model family.
arXiv Detail & Related papers (2025-12-29T09:55:12Z) - Towards a Comprehensive Scaling Law of Mixture-of-Experts [54.117786590884776]
We propose a comprehensive and precise joint MoE scaling law that considers all essential factors.<n>Our results demonstrate that the optimal settings for $G$ and $S$ are independent of both the model architecture and data size.<n>Our proposed MoE scaling law could function as an accurate and insightful guidance to facilitate future MoE model design and training.
arXiv Detail & Related papers (2025-09-28T06:35:34Z) - Complexity Scaling Laws for Neural Models using Combinatorial Optimization [5.291101237151254]
We develop scaling laws based on problem complexity.<n>We analyze two fundamental complexity measures: solution space size and representation space size.<n>We show that optimization promotes smooth cost trends, and therefore meaningful scaling laws can be obtained even in the absence of an interpretable loss.
arXiv Detail & Related papers (2025-06-15T18:20:35Z) - Combining Local Symmetry Exploitation and Reinforcement Learning for Optimised Probabilistic Inference -- A Work In Progress [2.2164989053903805]
Efficient probabilistic inference by variable elimination in graphical models requires an optimal elimination order.<n>We adapt a reinforcement learning approach to find efficient contraction orders in tensor networks.<n>We show that leveraging specific structures during inference allows for introducing compact encodings of intermediate results.
arXiv Detail & Related papers (2025-03-11T18:00:23Z) - MPruner: Optimizing Neural Network Size with CKA-Based Mutual Information Pruning [7.262751938473306]
Pruning is a well-established technique that reduces the size of neural networks while mathematically guaranteeing accuracy preservation.
We develop a new pruning algorithm, MPruner, that leverages mutual information through vector similarity.
MPruner achieved up to a 50% reduction in parameters and memory usage for CNN and transformer-based models, with minimal to no loss in accuracy.
arXiv Detail & Related papers (2024-08-24T05:54:47Z) - Scaling Exponents Across Parameterizations and Optimizers [94.54718325264218]
We propose a new perspective on parameterization by investigating a key assumption in prior work.
Our empirical investigation includes tens of thousands of models trained with all combinations of threes.
We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work.
arXiv Detail & Related papers (2024-07-08T12:32:51Z) - Just How Flexible are Neural Networks in Practice? [89.80474583606242]
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters.
In practice, however, we only find solutions via our training procedure, including the gradient and regularizers, limiting flexibility.
arXiv Detail & Related papers (2024-06-17T12:24:45Z) - Robustly estimating heterogeneity in factorial data using Rashomon Partitions [4.76518127830168]
We propose a novel framework for model uncertainty called Rashomon Partition Sets (RPS)<n>RPS consists of all models that have posterior density close to the maximum a posteriori (MAP) model.<n>We give simulation evidence along with three empirical examples: price effects on charitable giving, heterogeneity in chromosomal structure, and the introduction of microfinance.
arXiv Detail & Related papers (2024-04-02T17:53:28Z) - Kronecker-Factored Approximate Curvature for Modern Neural Network
Architectures [85.76673783330334]
Two different settings of linear weight-sharing layers motivate two flavours of Kronecker-Factored Approximate Curvature (K-FAC)
We show they are exact for deep linear networks with weight-sharing in their respective setting.
We observe little difference between these two K-FAC variations when using them to train both a graph neural network and a vision transformer.
arXiv Detail & Related papers (2023-11-01T16:37:00Z) - Transfer-Once-For-All: AI Model Optimization for Edge [0.0]
We propose Transfer-Once-For-All (TOFA) for supernet-style training on small data sets with constant computational training cost.
To overcome the challenges arising from small data, TOFA utilizes a unified semi-supervised training loss to simultaneously train all existings within the supernet.
arXiv Detail & Related papers (2023-03-27T04:14:30Z) - Variable Importance Matching for Causal Inference [73.25504313552516]
We describe a general framework called Model-to-Match that achieves these goals.
Model-to-Match uses variable importance measurements to construct a distance metric.
We operationalize the Model-to-Match framework with LASSO.
arXiv Detail & Related papers (2023-02-23T00:43:03Z) - Autoselection of the Ensemble of Convolutional Neural Networks with
Second-Order Cone Programming [0.8029049649310213]
This study proposes a mathematical model which prunes the ensemble of Convolutional Neural Networks (CNN)
The proposed model is tested on CIFAR-10, CIFAR-100 and MNIST data sets.
arXiv Detail & Related papers (2023-02-12T16:18:06Z) - Robust Binary Models by Pruning Randomly-initialized Networks [57.03100916030444]
We propose ways to obtain robust models against adversarial attacks from randomly-d binary networks.
We learn the structure of the robust model by pruning a randomly-d binary network.
Our method confirms the strong lottery ticket hypothesis in the presence of adversarial attacks.
arXiv Detail & Related papers (2022-02-03T00:05:08Z) - AutoDEUQ: Automated Deep Ensemble with Uncertainty Quantification [0.9449650062296824]
We propose AutoDEUQ, an automated approach for generating an ensemble of deep neural networks.
We show that AutoDEUQ outperforms probabilistic backpropagation, Monte Carlo dropout, deep ensemble, distribution-free ensembles, and hyper ensemble methods on a number of regression benchmarks.
arXiv Detail & Related papers (2021-10-26T09:12:23Z) - Optimizing model-agnostic Random Subspace ensembles [5.680512932725364]
We present a model-agnostic ensemble approach for supervised learning.
The proposed approach alternates between learning an ensemble of models using a parametric version of the Random Subspace approach.
We show the good performance of the proposed approach, both in terms of prediction and feature ranking, on simulated and real-world datasets.
arXiv Detail & Related papers (2021-09-07T13:58:23Z) - Post-mortem on a deep learning contest: a Simpson's paradox and the
complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models.
We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data.
We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z) - A Fully Tensorized Recurrent Neural Network [48.50376453324581]
We introduce a "fully tensorized" RNN architecture which jointly encodes the separate weight matrices within each recurrent cell.
This approach reduces model size by several orders of magnitude, while still maintaining similar or better performance compared to standard RNNs.
arXiv Detail & Related papers (2020-10-08T18:24:12Z) - ACDC: Weight Sharing in Atom-Coefficient Decomposed Convolution [57.635467829558664]
We introduce a structural regularization across convolutional kernels in a CNN.
We show that CNNs now maintain performance with dramatic reduction in parameters and computations.
arXiv Detail & Related papers (2020-09-04T20:41:47Z) - Pre-Trained Models for Heterogeneous Information Networks [57.78194356302626]
We propose a self-supervised pre-training and fine-tuning framework, PF-HIN, to capture the features of a heterogeneous information network.
PF-HIN consistently and significantly outperforms state-of-the-art alternatives on each of these tasks, on four datasets.
arXiv Detail & Related papers (2020-07-07T03:36:28Z) - Slice Sampling for General Completely Random Measures [74.24975039689893]
We present a novel Markov chain Monte Carlo algorithm for posterior inference that adaptively sets the truncation level using auxiliary slice variables.
The efficacy of the proposed algorithm is evaluated on several popular nonparametric models.
arXiv Detail & Related papers (2020-06-24T17:53:53Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.