Evaluating natural language processing models with generalization
metrics that do not need access to any training or testing data
- URL: http://arxiv.org/abs/2202.02842v3
- Date: Sun, 4 Jun 2023 21:59:47 GMT
- Title: Evaluating natural language processing models with generalization
metrics that do not need access to any training or testing data
- Authors: Yaoqing Yang, Ryan Theisen, Liam Hodgkinson, Joseph E. Gonzalez,
Kannan Ramchandran, Charles H. Martin, Michael W. Mahoney
- Abstract summary: We provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics.
Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks.
- Score: 66.11139091362078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Selecting suitable architecture parameters and training hyperparameters is
essential for enhancing machine learning (ML) model performance. Several recent
empirical studies conduct large-scale correlational analysis on neural networks
(NNs) to search for effective \emph{generalization metrics} that can guide this
type of model selection. Effective metrics are typically expected to correlate
strongly with test performance. In this paper, we expand on prior analyses by
examining generalization-metric-based model selection with the following
objectives: (i) focusing on natural language processing (NLP) tasks, as prior
work primarily concentrates on computer vision (CV) tasks; (ii) considering
metrics that directly predict \emph{test error} instead of the
\emph{generalization gap}; (iii) exploring metrics that do not need access to
data to compute. From these objectives, we are able to provide the first model
selection results on large pretrained Transformers from Huggingface using
generalization metrics. Our analyses consider (I) hundreds of Transformers
trained in different settings, in which we systematically vary the amount of
data, the model size and the optimization hyperparameters, (II) a total of 51
pretrained Transformers from eight families of Huggingface NLP models,
including GPT2, BERT, etc., and (III) a total of 28 existing and novel
generalization metrics. Despite their niche status, we find that metrics
derived from the heavy-tail (HT) perspective are particularly useful in NLP
tasks, exhibiting stronger correlations than other, more popular metrics. To
further examine these metrics, we extend prior formulations relying on power
law (PL) spectral distributions to exponential (EXP) and
exponentially-truncated power law (E-TPL) families.
Related papers
- What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Scaling Parameter-Constrained Language Models with Quality Data [32.35610029333478]
Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters.
We extend the conventional understanding of scaling law by offering a microscopic view of data quality within the original formulation.
arXiv Detail & Related papers (2024-10-04T02:07:17Z) - Transformers are Minimax Optimal Nonparametric In-Context Learners [36.291980654891496]
In-context learning of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples.
We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer.
We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context.
arXiv Detail & Related papers (2024-08-22T08:02:10Z) - Few-Shot Load Forecasting Under Data Scarcity in Smart Grids: A Meta-Learning Approach [0.18641315013048293]
This paper proposes adapting an established model-agnostic meta-learning algorithm for short-term load forecasting.
The proposed method can rapidly adapt and generalize within any unknown load time series of arbitrary length.
The proposed model is evaluated using a dataset of historical load consumption data from real-world consumers.
arXiv Detail & Related papers (2024-06-09T18:59:08Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - MARS: Meta-Learning as Score Matching in the Function Space [79.73213540203389]
We present a novel approach to extracting inductive biases from a set of related datasets.
We use functional Bayesian neural network inference, which views the prior as a process and performs inference in the function space.
Our approach can seamlessly acquire and represent complex prior knowledge by metalearning the score function of the data-generating process.
arXiv Detail & Related papers (2022-10-24T15:14:26Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Post-mortem on a deep learning contest: a Simpson's paradox and the
complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models.
We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data.
We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.