Evaluating natural language processing models with generalization
metrics that do not need access to any training or testing data
- URL: http://arxiv.org/abs/2202.02842v3
- Date: Sun, 4 Jun 2023 21:59:47 GMT
- Title: Evaluating natural language processing models with generalization
metrics that do not need access to any training or testing data
- Authors: Yaoqing Yang, Ryan Theisen, Liam Hodgkinson, Joseph E. Gonzalez,
Kannan Ramchandran, Charles H. Martin, Michael W. Mahoney
- Abstract summary: We provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics.
Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks.
- Score: 66.11139091362078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Selecting suitable architecture parameters and training hyperparameters is
essential for enhancing machine learning (ML) model performance. Several recent
empirical studies conduct large-scale correlational analysis on neural networks
(NNs) to search for effective \emph{generalization metrics} that can guide this
type of model selection. Effective metrics are typically expected to correlate
strongly with test performance. In this paper, we expand on prior analyses by
examining generalization-metric-based model selection with the following
objectives: (i) focusing on natural language processing (NLP) tasks, as prior
work primarily concentrates on computer vision (CV) tasks; (ii) considering
metrics that directly predict \emph{test error} instead of the
\emph{generalization gap}; (iii) exploring metrics that do not need access to
data to compute. From these objectives, we are able to provide the first model
selection results on large pretrained Transformers from Huggingface using
generalization metrics. Our analyses consider (I) hundreds of Transformers
trained in different settings, in which we systematically vary the amount of
data, the model size and the optimization hyperparameters, (II) a total of 51
pretrained Transformers from eight families of Huggingface NLP models,
including GPT2, BERT, etc., and (III) a total of 28 existing and novel
generalization metrics. Despite their niche status, we find that metrics
derived from the heavy-tail (HT) perspective are particularly useful in NLP
tasks, exhibiting stronger correlations than other, more popular metrics. To
further examine these metrics, we extend prior formulations relying on power
law (PL) spectral distributions to exponential (EXP) and
exponentially-truncated power law (E-TPL) families.
Related papers
- Meta-Statistical Learning: Supervised Learning of Statistical Inference [59.463430294611626]
This work demonstrates that the tools and principles driving the success of large language models (LLMs) can be repurposed to tackle distribution-level tasks.
We propose meta-statistical learning, a framework inspired by multi-instance learning that reformulates statistical inference tasks as supervised learning problems.
arXiv Detail & Related papers (2025-02-17T18:04:39Z) - UNEM: UNrolled Generalized EM for Transductive Few-Shot Learning [35.62208317531141]
We advocate and introduce the unrolling paradigm, also referred to as "learning to optimize"
Our unrolling approach covers various statistical feature distributions and pre-training paradigms.
We report comprehensive experiments, which cover a breadth of fine-grained downstream image classification tasks.
arXiv Detail & Related papers (2024-12-21T19:01:57Z) - What Do Learning Dynamics Reveal About Generalization in LLM Reasoning? [83.83230167222852]
We find that a model's generalization behavior can be effectively characterized by a training metric we call pre-memorization train accuracy.
By connecting a model's learning behavior to its generalization, pre-memorization train accuracy can guide targeted improvements to training strategies.
arXiv Detail & Related papers (2024-11-12T09:52:40Z) - Scaling Parameter-Constrained Language Models with Quality Data [32.35610029333478]
Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters.
We extend the conventional understanding of scaling law by offering a microscopic view of data quality within the original formulation.
arXiv Detail & Related papers (2024-10-04T02:07:17Z) - Transformers are Minimax Optimal Nonparametric In-Context Learners [36.291980654891496]
In-context learning of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples.
We develop approximation and generalization error bounds for a transformer composed of a deep neural network and one linear attention layer.
We show that sufficiently trained transformers can achieve -- and even improve upon -- the minimax optimal estimation risk in context.
arXiv Detail & Related papers (2024-08-22T08:02:10Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Post-mortem on a deep learning contest: a Simpson's paradox and the
complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models.
We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data.
We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.