Adjusting Model Size in Continual Gaussian Processes: How Big is Big Enough?
- URL: http://arxiv.org/abs/2408.07588v2
- Date: Fri, 13 Dec 2024 19:36:43 GMT
- Title: Adjusting Model Size in Continual Gaussian Processes: How Big is Big Enough?
- Authors: Guiomar Pescador-Barrios, Sarah Filippi, Mark van der Wilk,
- Abstract summary: Many machine learning models require setting a parameter that controls their size before training.
This leads to the question How big is big enough?''
Here, data becomes available incrementally, and the final dataset size will therefore not be known before training.
We develop a method to automatically adjust model size while maintaining near optimal performance.
- Score: 11.43983519639935
- License:
- Abstract: Many machine learning models require setting a parameter that controls their size before training, e.g.~number of neurons in DNNs, or inducing points in GPs. Increasing capacity typically improves performance until all the information from the dataset is captured. After this point, computational cost keeps increasing without improved performance. This leads to the question ``How big is big enough?'' We investigate this problem for Gaussian processes (single-layer neural networks) in continual learning. Here, data becomes available incrementally, and the final dataset size will therefore not be known before training, preventing the use of heuristics for setting a fixed model size. We develop a method to automatically adjust model size while maintaining near-optimal performance. Our experimental procedure follows the constraint that any hyperparameters must be set without seeing dataset properties. For our method, a single hyperparameter setting works well across diverse datasets, showing that it requires less tuning compared to others.
Related papers
- Computation-Aware Gaussian Processes: Model Selection And Linear-Time Inference [55.150117654242706]
We show that model selection for computation-aware GPs trained on 1.8 million data points can be done within a few hours on a single GPU.
As a result of this work, Gaussian processes can be trained on large-scale datasets without significantly compromising their ability to quantify uncertainty.
arXiv Detail & Related papers (2024-11-01T21:11:48Z) - Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler [34.416299887009195]
We study the correlation between optimal learning rate, batch size, and number of training tokens for the recently proposed WSD scheduler.
We propose a new learning rate scheduler, Power scheduler, that is agnostic about the number of training tokens and batch size.
Our 3B dense and MoE models trained with the Power scheduler achieve comparable performance as state-of-the-art small language models.
arXiv Detail & Related papers (2024-08-23T20:22:20Z) - Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation.
By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z) - RepCNN: Micro-sized, Mighty Models for Wakeword Detection [3.4888176891918654]
Always-on machine learning models require a very low memory and compute footprint.
We show that a small convolutional model can be better trained by first its computation into a larger multi-branched architecture.
We show that our always-on wake-word detector model, RepCNN, provides a good trade-off between latency and accuracy during inference.
arXiv Detail & Related papers (2024-06-04T16:14:19Z) - A Dynamical Model of Neural Scaling Laws [79.59705237659547]
We analyze a random feature model trained with gradient descent as a solvable model of network training and generalization.
Our theory shows how the gap between training and test loss can gradually build up over time due to repeated reuse of data.
arXiv Detail & Related papers (2024-02-02T01:41:38Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Hyperparameter-free Continuous Learning for Domain Classification in
Natural Language Understanding [60.226644697970116]
Domain classification is the fundamental task in natural language understanding (NLU)
Most existing continual learning approaches suffer from low accuracy and performance fluctuation.
We propose a hyper parameter-free continual learning model for text data that can stably produce high performance under various environments.
arXiv Detail & Related papers (2022-01-05T02:46:16Z) - Parameter-Efficient Transfer from Sequential Behaviors for User Modeling
and Recommendation [111.44445634272235]
In this paper, we develop a parameter efficient transfer learning architecture, termed as PeterRec.
PeterRec allows the pre-trained parameters to remain unaltered during fine-tuning by injecting a series of re-learned neural networks.
We perform extensive experimental ablation to show the effectiveness of the learned user representation in five downstream tasks.
arXiv Detail & Related papers (2020-01-13T14:09:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.