Evaluating natural language processing models with generalization
metrics that do not need access to any training or testing data
- URL: http://arxiv.org/abs/2202.02842v3
- Date: Sun, 4 Jun 2023 21:59:47 GMT
- Title: Evaluating natural language processing models with generalization
metrics that do not need access to any training or testing data
- Authors: Yaoqing Yang, Ryan Theisen, Liam Hodgkinson, Joseph E. Gonzalez,
Kannan Ramchandran, Charles H. Martin, Michael W. Mahoney
- Abstract summary: We provide the first model selection results on large pretrained Transformers from Huggingface using generalization metrics.
Despite their niche status, we find that metrics derived from the heavy-tail (HT) perspective are particularly useful in NLP tasks.
- Score: 66.11139091362078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Selecting suitable architecture parameters and training hyperparameters is
essential for enhancing machine learning (ML) model performance. Several recent
empirical studies conduct large-scale correlational analysis on neural networks
(NNs) to search for effective \emph{generalization metrics} that can guide this
type of model selection. Effective metrics are typically expected to correlate
strongly with test performance. In this paper, we expand on prior analyses by
examining generalization-metric-based model selection with the following
objectives: (i) focusing on natural language processing (NLP) tasks, as prior
work primarily concentrates on computer vision (CV) tasks; (ii) considering
metrics that directly predict \emph{test error} instead of the
\emph{generalization gap}; (iii) exploring metrics that do not need access to
data to compute. From these objectives, we are able to provide the first model
selection results on large pretrained Transformers from Huggingface using
generalization metrics. Our analyses consider (I) hundreds of Transformers
trained in different settings, in which we systematically vary the amount of
data, the model size and the optimization hyperparameters, (II) a total of 51
pretrained Transformers from eight families of Huggingface NLP models,
including GPT2, BERT, etc., and (III) a total of 28 existing and novel
generalization metrics. Despite their niche status, we find that metrics
derived from the heavy-tail (HT) perspective are particularly useful in NLP
tasks, exhibiting stronger correlations than other, more popular metrics. To
further examine these metrics, we extend prior formulations relying on power
law (PL) spectral distributions to exponential (EXP) and
exponentially-truncated power law (E-TPL) families.
Related papers
- Few-Shot Load Forecasting Under Data Scarcity in Smart Grids: A Meta-Learning Approach [0.18641315013048293]
This paper proposes adapting an established model-agnostic meta-learning algorithm for short-term load forecasting.
The proposed method can rapidly adapt and generalize within any unknown load time series of arbitrary length.
The proposed model is evaluated using a dataset of historical load consumption data from real-world consumers.
arXiv Detail & Related papers (2024-06-09T18:59:08Z) - The Fine Line: Navigating Large Language Model Pretraining with Down-streaming Capability Analysis [60.52921835351632]
This paper undertakes a comprehensive comparison of model capabilities at various pretraining intermediate checkpoints.
We confirm that specific downstream metrics exhibit similar training dynamics across models of different sizes.
In addition to our core findings, we've reproduced Amber and OpenLLaMA, releasing their intermediate checkpoints.
arXiv Detail & Related papers (2024-04-01T16:00:01Z) - Few-shot learning for automated content analysis: Efficient coding of
arguments and claims in the debate on arms deliveries to Ukraine [0.9576975587953563]
Pre-trained language models (PLM) based on transformer neural networks offer great opportunities to improve automatic content analysis in communication science.
Three characteristics so far impeded the widespread adoption of the methods in the applying disciplines: the dominance of English language models in NLP research, the necessary computing resources, and the effort required to produce training data to fine-tune PLMs.
We test our approach on a realistic use case from communication science to automatically detect claims and arguments together with their stance in the German news debate on arms deliveries to Ukraine.
arXiv Detail & Related papers (2023-12-28T11:39:08Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - MARS: Meta-Learning as Score Matching in the Function Space [79.73213540203389]
We present a novel approach to extracting inductive biases from a set of related datasets.
We use functional Bayesian neural network inference, which views the prior as a process and performs inference in the function space.
Our approach can seamlessly acquire and represent complex prior knowledge by metalearning the score function of the data-generating process.
arXiv Detail & Related papers (2022-10-24T15:14:26Z) - Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore.
We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z) - Post-mortem on a deep learning contest: a Simpson's paradox and the
complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models.
We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data.
We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.