Evaluating Large Language Models for Generalization and Robustness via
Data Compression
- URL: http://arxiv.org/abs/2402.00861v2
- Date: Sun, 4 Feb 2024 01:16:25 GMT
- Title: Evaluating Large Language Models for Generalization and Robustness via
Data Compression
- Authors: Yucheng Li, Yunhao Guo, Frank Guerin, Chenghua Lin
- Abstract summary: We propose a data compression based evaluation approach that tests how models' predictive abilities generalize after their training cutoff.
Specifically, we collect comprehensive test data spanning 83 months from 2017 to 2023 and split the data into training and testing periods according to models' training data cutoff.
Our experiments test 14 representative large language models with various sizes on sources including Wikipedia, news articles, code, arXiv papers, and multi-modal data.
- Score: 19.17779153163157
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing methods for evaluating large language models face challenges such as
data contamination, sensitivity to prompts, and the high cost of benchmark
creation. To address this, we propose a lossless data compression based
evaluation approach that tests how models' predictive abilities generalize
after their training cutoff. Specifically, we collect comprehensive test data
spanning 83 months from 2017 to 2023 and split the data into training and
testing periods according to models' training data cutoff. We measure: 1) the
compression performance on the testing period as a measure of generalization on
unseen data; and 2) the performance gap between the training and testing period
as a measure of robustness. Our experiments test 14 representative large
language models with various sizes on sources including Wikipedia, news
articles, code, arXiv papers, and multi-modal data. We find that the
compression rate of many models reduces significantly after their cutoff date,
but models such as Mistral and Llama-2 demonstrate a good balance between
performance and robustness. Results also suggest that models struggle to
generalize on news and code data, but work especially well on arXiv papers. We
also find the context size and tokenization implementation have a big impact of
on the overall compression performance.
Related papers
- Self-calibration for Language Model Quantization and Pruning [38.00221764773372]
Quantization and pruning are fundamental approaches for model compression.
In a post-training setting, state-of-the-art quantization and pruning methods require calibration data.
We propose self-calibration as a solution.
arXiv Detail & Related papers (2024-10-22T16:50:00Z) - Ranking LLMs by compression [13.801767671391604]
We use five large language models as priors for compression, then compare their performance on challenging natural language processing tasks.
Experimental results show that compression ratio and model performance are positively correlated, so it can be used as a general metric to evaluate large language models.
arXiv Detail & Related papers (2024-06-20T10:23:38Z) - The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes [30.30769701138665]
We introduce and explore the Mirrored Influence Hypothesis, highlighting a reciprocal nature of influence between training and test data.
Specifically, it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent, yet inverse problem.
We introduce a new method for estimating the influence of training data, which requires calculating gradients for specific test samples, paired with a forward pass for each training point.
arXiv Detail & Related papers (2024-02-14T03:43:05Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Learning Defect Prediction from Unrealistic Data [57.53586547895278]
Pretrained models of code have become popular choices for code understanding and generation tasks.
Such models tend to be large and require commensurate volumes of training data.
It has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs.
Models trained on such data tend to only perform well on similar data, while underperforming on real world programs.
arXiv Detail & Related papers (2023-11-02T01:51:43Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Test-Time Training on Nearest Neighbors for Large Language Models [25.365366617508663]
We build a large-scale distributed index based on text embeddings of the Pile dataset.
For each test input, our system retrieves its neighbors and fine-tunes the model on their text.
Surprisingly, retrieving and training on as few as 20 neighbors drastically improves performance across more than 20 language modeling tasks.
arXiv Detail & Related papers (2023-05-29T08:03:28Z) - Preserving Knowledge Invariance: Rethinking Robustness Evaluation of
Open Information Extraction [50.62245481416744]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.
We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.
By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z) - Impact of Pretraining Term Frequencies on Few-Shot Reasoning [51.990349528930125]
We investigate how well pretrained language models reason with terms that are less frequent in the pretraining data.
We measure the strength of this correlation for a number of GPT-based language models on various numerical deduction tasks.
Although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data.
arXiv Detail & Related papers (2022-02-15T05:43:54Z) - How much pretraining data do language models need to learn syntax? [12.668478784932878]
Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks.
We study the impact of pretraining data size on the knowledge of the models using RoBERTa.
arXiv Detail & Related papers (2021-09-07T15:51:39Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.